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SECTION  1 


Introduction 

This  report  describes  progress  made  by  our  research  group  over  the  period 
September  1983  to  September  1986.  Although  we  explored  a  range  of  issues  in 
connectionist  learning,  the  major  focus  was  the  study  of  learning  nonlinear  associa¬ 
tive  mappings  by  layered  networks.  In  earlier  research  we  obtained  some  preliminary 
results  with  an  approach  to  this  problem  based  on  reinforcement  learning  [10,3,15], 
However,  the  specific  reinforcement  learning  rules  used  in  these  studies  did  not  pro¬ 
duce  rapid  and  reliable  learning  in  all  the  learning  tasks  we  tried.  During  the  period 
reported  on  here,  we  set  out  to  obtain  better  understanding  of  this  class  of  methods 
through  computer  simulation  and  mathematical  analysis. 

The  research  direction  that  proved  most  fruitful  was  our  effort  to  develop  rigorous 
ties  between  our  reinforcement-learning  adaptive  units  and  the  theory  of  stochastic 
learning  automata.  Our  initial  aim  was  to  develop  a  theoretically  tractable  learning 
rule  by  developing  one  that  specialized,  under  one  set  of  restrictions,  to  a  familiar 
supervised-learning  rule  while  also  specializing,  under  another  set  of  restrictions,  to 
one  of  the  simplest  of  the  stochastic  learning  automaton  algorithms.  The  result  is  a 
learning  rule  that  we  call  the  Associative  Reward- Penalty,  or  Ar  p  ,  learning  rule.  It 
is  very  closely  related  to  a  relatively  little-known  learning  rule  presented  by  Widrow, 
Gupta,  and  Maitra  (58|  that  they  called  the  “selective  bootstrap  adaptation”  rule. 
Thus,  although  it  is  novel,  the  Ar  p  rule  is  closely  connected  to  existing  theory,  and 
we  were  able  to  prove  a  convergence  theorem  for  a  single  adaptive  unit  implementing 
the  Ar  p  rule  (we  call  such  a  unit  an  Ar  p  unit)  j7j.  What  was  rather  surprising  was 
that  the  Ar  p  unit  turned  out  to  perform  very  well  as  a  network  component.  Lay¬ 
ered  networks  of  Ar  p  units  solve  nonlinear  associative  learning  problems  with  great 
reliability.  This  surprised  us  because  in  devising  the  Ar  p  unit  we  were  concerned 
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solely  with  the  mathematical  tractability  of  a  single  unit  and  not  with  network  perfor¬ 
mance.  However,  it  later  became  apparent  that  the  learning  capabilities  that  single 
Ar_p  units  provably  possess  are  crucial  in  obtaining  reliable  learning  in  networks. 
Consequently,  as  a  result  of  our  attempts  to  shore  up  the  mathematical  foundations 
of  our  work,  we  developed  an  adaptive  unit  that  performed  much  better  in  networks 
than  any  of  the  others  we  had  tried. 

During  the  period  covered  in  this  report,  a  number  of  other  research  groups  be¬ 
came  interested  in  the  problem  of  learning  nonlinear  associative  mappings  by  layered 
networks,  or  more  generally,  the  problem  of  learning  by  “hidden  units.”  In  addi¬ 
tion  to  our  own  method  using  Ar_p  units,  two  new  methods  were  developed:  the 
Boltzmann  learning  procedure  of  Ackley,  Hinton,  and  Sejnowski  [l]  and  the  error 
back-propagation  method  of  Rumelhart,  Hinton,  and  Williams  [44].  These  meth¬ 
ods  attracted  much  public  attention,  especially  the  backpropagation  method  which 
Sejnowski  and  Rosenberg  [45]  used  in  a  system  called  NETtalk  that  learns  how  to 
convert  text  to  speech.  Unlike  Boltzmann  learning,  which  applies  to  symmetrically 
connected  networks,  the  error  back-propagation  method  applies  to  networks  with¬ 
out  cycles  (acyclic  networks).  Consequently,  error  back-propagation  is  more  directly 
comparable  to  the  AR_P  method  than  is  Boltzmann  learning. 

We  invested  much  effort  in  performing  simulations  to  compare  various  methods 
for  learning  in  layered  networks,  including  the  error  back-propagaton  method,  and 
the  results  are  reported  here.  In  the  comparisons,  we  included  methods  that  represent 
several  different  approaches  including  the  most  brute-force  search  method  possible. 
We  chose  a  learning  task  that  was  hard  enough  to  make  the  brute-force  search  in¬ 
efficient  but  not  so  hard  that  enormous  amounts  of  CPU  time  were  required.  On 
this  task,  the  6  input  multiplexer  task  (see  Section  4),  the  error  back-propagation 
method  proved  to  be  the  fastest  with  a  modified  Ar_p  method  coming  second  and 
the  unmodified  AR_P  method  third.  We  did  not  systematically  apply  these  meth¬ 
ods  to  a  series  of  increasingly  difficult  learning  tasks  in  order  to  assess  how  they 
“scale”  to  larger  problems.  Our  experience  and  theoretical  understanding  suggest, 
however,  that  the  ordering  of  performance  observed  on  the  multiplexer  task  would 
be  preserved  on  more  difficult  tasks.  The  comparative  simulations  do  establish  that 
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both  the  error  back-propagation  and  the  Ar  p  methods  are  very  much  better  than 
a  variety  of  more  conventional  search  methods. 


Although  we  have  not  extended  the  Ar_p  convergence  theorem,  which  applies  to 
a  single  adaptive  unit,  to  a  network  of  Ar-p  units,  much  theoretical  insight  into  the 
behavior  of  AR_p  networks  has  been  provided  by  a  result  proved  by  R.  Williams  (one 
of  the  developers  of  the  error  back-propagation  method).  Williams  (61,62)  has  shown 
that  under  certain  restrictions  on  the  Ar_p  rule,  the  expected  change  of  any  weight 
within  an  arbitrary  acyclic  network  of  Ar_p  units  is  proportional  to  the  gradient  of 
the  probability  of  reward  for  the  entire  network  with  respect  to  that  weight.  This  re 
suit  means  that  Ar  p  networks  do  something  similar  to  what  error  back-propagation 
networks  do,  but  they  use  estimates  of  the  gradient  which  can  be  determined  without 
the  need  for  explicit  back-propagation. 

Because  a  gradient  is  estimated  by  Ar_p  networks,  the  following  modified  training 
procedure  is  suggested.  Instead  of  updating  weights  after  a  single  presentation  of  an 
input  pattern  and  the  generation  of  a  single  activity  pattern,  one  can  hold  the  input 
pattern  constant  for  several  time  steps  and  accumulate  a  gradient  estimate  during 
the  generation  of  several  activity  patterns.  Updating  the  weights  on  the  basis  of  this 
improved  gradient  estimate  should  improve  learning  rate.  We  report  the  results  of 
simulations  designed  to  test  this  hypothesis  in  Section  4. 

Also  reported  here  are  results  obtained  from  applications  of  layered-network 
methods  to  two  different  tasks  requiring  the  learning  of  problem-solving  strategies. 
'I'he  first  task  is  the  pole-balancing  task  that  we  have  used  in  the  past  to  demon¬ 
strate  reinforcement  leaning  under  conditions  of  delayed  reinforcement  (13|.  The 
second  task  is  to  learn  how  to  solve  the  Tower  of  Hanoi  puzzle  using  a  method  that 
is  esentiallv  the  same  as  the  method  used  in  learning  to  balance  the  pole.  Our  earlier 
work  with  the  pole-balancing  problem  assumed  the  existence  of  a  representation  for 
the  system’s  state  consisting  of  a  large  number  of  non-overlapping  “boxes”  produced 
by  a  pre-existing  decoder,  (liven  this  representation,  the  task  became  one  of  filling  in 
look-up  tables.  This  simplified  representation  allowed  us  to  separate  representation 
issues  from  the  issues  of  temporal  credit-assignment.  In  the  studies  reported  here, 
the  pre-existing  decoder  is  replaced  by  a  layered  adaptive  network.  This  network 
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receives  as  input  a  vector  of  four  real  numbers  giving  the  state  of  the  cart/pole  sys¬ 
tem.  The  network  has  to  learn  how  to  represent  the  state  so  that  the  system  as  a 
whole  can  successfully  avoid  failure.  The  layered  network  provides  a  kind  of  adaptive 
decoder.  In  order  to  accomplish  this,  the  adaptive  critic  element  and  the  associative 
search  element  of  previous  studies  were  combined  with  the  error  back-propagation 
method  for  learning  in  layered  networks.  The  resulting  system  was  able  to  learn 
appropriate  mappings  for  the  control  actions  and  the  internal  evaluation,  and  it  was 
demonstrated  that  the  multilayer  system  dramatically  outperformed  a  s:ngle  layer 
system. 

Much  the  same  approach  was  taken  with  the  Tower  of  Hanoi  puzzle.  The  state 
of  the  puzzle  was  represented  as  a  binary  vector  that  acted  as  input  to  two  layered 
networks,  one  of  which  was  responsible  for  forming  an  informative  evaluation  func¬ 
tion,  and  the  other  of  which  was  responsible  for  forming  the  correct  mapping  from 
puzzle  states  to  actions  (moving  the  disks).  This  system  consistently  learned  to  solve 
the  puzzle  using  the  minimum  number  of  moves.  This  example  allowed  us  to  discuss 
the  relationship  between  our  strategy  learning  methods  and  an  adaptive  production 
system  that  has  been  applied  to  this  puzzle  (31  j. 

In  the  concluding  section  of  this  report,  I  place  our  results  in  perspective  by 
discussing  their  relationship  to  more  conventional  engineering  methods.  I  also  discuss 
directions  in  which  I  think  it  will  be  profitable  to  continue  the  development  of  these 
methods. 


SECTION  2 


The  Associative  Reward- Penalty  Unit 


We  developed  a  learning  rule  that  we  call  the  associative  reward-penalty ,  or  Ar_p  , 
rule  [7, 6, 9, 8].  This  rule,  which  can  be  implemented  by  a  neuron-like  adaptive  unit 
that  we  call  an  Ar_p  unit,  is  a  refinement  of  similar  learning  rules  that  we  had 
studied  earlier.  We  devised  it  by  combining  aspects  of  algorithms  for  stochastic 
learning  automata  with  aspects  of  algorithms  for  pattern  classification  or  system 
identification.  As  a  result  of  this  hybrid  nature,  this  method  differs  in  critical  ways 
from  the  methods,  such  as  the  perceptron  and  Widrow/Hoff  LMS  methods,  that  have 
become  widely  used  in  connectionist  systems  (for  details,  see  Ref.  [6]).  I  first  give  an 
informal  description  of  the  Ar-p  learning  rule,  after  which  I  specify  it  more  formally 
and  define  the  task  it  was  devised  to  solve. 

The  Ar_p  rule  is  an  embellishment  of  Thorndike’s  [52]  “Law  of  Effect”: 

Of  several  responses  made  to  the  same  situation,  those  which  are  ac¬ 
companied  or  closely  followed  by  satisfaction  to  the  animal  will,  other 
things  being  equal,  be  more  firmly  connected  with  the  situation,  so  that, 
when  it  recurs,  they  will  be  more  likely  to  recur;  those  which  are  accom¬ 
panied  or  closely  followed  by  discomfort  to  the  animal  will,  other  things 
being  equal,  have  their  connections  with  that  situation  weakened,  so  that, 
when  it  recurs,  they  will  be  less  likely  to  occur.  The  greater  the  satis¬ 
faction  or  discomfort,  the  greater  the  strengthening  or  weakening  of  the 
bond.  (p.  244) 

Although  a  literal  interpretation  of  this  “law"  has  numerous  difficulties  with  respect 
to  animal  learning  data,  it  remains  a  principle  whose  basic  features  have  considerable. 
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but  not  uncontested,  validity  [33].  The  Ar_p  rule  implements  the  basic  idea  of  the 
Law  of  Effect,  but  it  was  necessary  to  add  a  number  of  refinements  in  order  to  make 
it  work  correctly. 

Each  situation  referred  to  in  the  Law  of  Effect  corresponds  to  an  input  vector, 
or  key,  that  is  received  as  input  by  an  Ar_p  unit.  From  this  input  vector,  the  unit 
determines  an  “activation  level”  which  is  the  weighted  sum  of  the  components  of  the 
input  vector,  where  the  weights  make  up  the  unit’s  current  weight  vector.  The  unit 
then  determines  its  action  by  comparing  its  activation  level  with  a  randomly  varying 
threshold,  “firing”  (action  =  1)  when  the  activation  exceeds  the  current  threshold 
value,  and  otherwise  not  firing  (action  =  0).  The  noise  in  the  threshold  is  such  that 
when  the  activation  is  zero,  the  two  actions  are  equiprobable;  when  it  is  positive, 
firing  is  the  more  likely  action;  and  when  it  is  negative,  not  firing  is  the  more  likely 
action.  The  activation  level  therefore  determines  the  strength  of  the  bond  between 
the  situation  and  the  actions.  As  the  weights  change  so  as  to  increase  the  magnitude 
of  the  activation  for  specific  input  vectors,  the  bond  between  those  vectors  and  the 
various  actions  increases — positive  activation  producing  a  bond  between  the  input 
vector  and  firing;  negative  activation  producing  a  bond  between  the  vector  and  not 
firing. 

The  Ar_p  learning  rule  causes  the  weight  vector  to  change  in  such  a  way  that 
if  an  action  emitted  in  the  presence  of  situation  x  yields  an  evaluation  of  “reward,” 
the  unit  is  more  likely  to  produce  the  same  action  when  x,  or  situations  similar  to 
x,  occur  in  the  future;  in  the  case  of  penalty,  weights  change  in  such  a  way  that  the 
unit  is  more  likely  to  produce  the  other  action,  when  x,  or  situations  similar  to  x, 
occur  in  the  future.  In  order  for  this  process  to  converge  correctly  to  the  actions 
that  correspond  to  the  highest  probability  of  reward,  it  is  necessary  to  change  the 
weights  asymmetrically  in  the  cases  of  reward  and  penalty.  Changes  in  the  case  of 
penalty  must  be  much  smaller  than  the  corresponding  changes  would  be  in  the  case 
of  reward.  In  the  following  sections,  more  technical  descriptions  of  these  ideas  are 
presented. 


The  Associative  Reinforcement  Learning  Task 


The  Ar_p  learning  rule  is  designed  to  solve  what  we  call  associative  reinforcement 
learning  tasks.  In  these  tasks  the  learning  system  and  its  environment  interact  in 
a  closed  loop.  At  each  discrete  time  step,  or  trial,  t ,  the  environment  provides 
the  learning  system  with  a  pattern  vector,  x[f],  selected  from  a  finite  set  of  vectors 
X  =  {x^\ . . . ,  x(m)},  x €  5Rn;  the  learning  system  emits  an  action,  y\t\,  chosen  from 
the  finite  set  Y  =  {yi, . . . ,  {/*};  the  environment  receives  y[<]  as  input  and  sends  to 
the  learning  system  a  reward/penalty  signal  r[f]  €  {reward,  penalty}  that  evaluates 
the  action  y[t].  The  environment  determines  the  evaluation  according  to  a  map 
d  :  X  xY  — ♦  [0, 1],  where  d(x,y)  =  Pr{r[t]  =  reward  |  x[<j  =  x,y[<]  =  y}.  Ideally, 
one  wants  the  learning  system  eventually  to  respond  to  each  input  vector  x  E  X  with 
action  y*  with  probability  1,  where  y*  is  such  that  d(x,yx)  =  maxvey{d(x,y)}. 

As  pointed  out  in  Ref.  [7],  in  the  case  of  a  single,  nonzero  input  vector,  this  task 
reduces  to  the  task  usually  studied  by  learning  automaton  theorists  (which,  according 
to  the  terminology  used  here,  is  a  nonassociative  reinforcement  learning  task);  see 
Section  2  and  Ref.  [36],  On  the  other  hand,  in  the  case  of  two  actions  (|V|  = 
2)  the  task  reduces  to  a  conventional  formulation  of  supervised  learning  pattern 
classification  (see  [10])  if  for  each  x  €  X,  d(x,yx)  +  d(x,y2)  —  1.  This  restriction 
(assuming  it  is  known  to  hold)  implies  that  feedback  received  from  performing  one 
action  provides  information  about  the  other  action.  This  makes  the  task  much  easier 
and  allows  conventional  supervised  learning  pattern-classification  algorithms  (slightly 
modified)  to  succeed  (see  Ref.  [7]  for  details). 


The  Ar _ p  Learning  Rule 


The  Ar-p  rule’s  action  selection  method  is  parameterized  at  step  t  by  a  weight 
vector  ud<|  E  9?n: 


y[ t 


1 J1’ 

if  u>[<;Tx|fj  +  | 

\  0, 

otherwise; 

(2-1) 


where  u;[t]Tx(*]  is  the  inner  product  of  u>[t]  and  x[f],  and  the  r;[t]  are  independent 
identically  distributed  random  variables,  each  having  distribution  function  'P. 

According  to  Equation  2.1,  the  action  probabilities  at  step  t  are  conditional  on 
the  input  vector  in  a  manner  determined  by  the  parameter  vector  u/[<].  In  particular 

p0x[f]  =  Pr{y[*]  -  0|x[<]  =  x}  =  Pr{u>[t]Tx  +  q[<]  <  0}  =  ty(-u;(<]Tx),  (2.2) 

and 

plz[t\  =  Pr{y[t\  =  l|*[t]  =  x}  =  1  -  p0z\t\.  (2.3) 

If,  for  example,  each  random  variable  r7 [< )  has  zero  mean,  then  when  ut[fjTx  —  0, 
the  probability  that  each  action  is  emitted  given  input  vector  x  is  .5;  when  u>[f]Tx 
is  positive,  action  y[<]  =  1  is  the  more  likely  action;  and  when  u;[f]Tz  is  negative, 
action  y(t]  =  0  is  the  more  likely.1  As  |tu[£]Tx|  increases  for  all  z  E  X,  the  mapping 
Equation  2.1  approaches  a  deterministic  linear  discriminate  function. 

The  parameter  vector  is  updated  according  to  the  following  equation: 

f  p[t](y[tj  -  Pll(<])x[<],  if  r[f]  = reward; 
wit  +  1  -  w  f  =  <  (2.4) 

l  Ap(*|(l  -  y\t\  -  plx(<|)x[/|,  if  r\t\  =penalty; 

where  0  <  A  <  1  and  p\t\  >  0. 

In  the  case  of  reward,  according  to  Equation  2,4,  w  changes  so  that  the  probability 
of  the  action  chosen,  conditional  on  the  current  input  vector,  moves  toward  1  (if 
y[<]  -  1  then  w  changes  so  that  plz  approaches  1;  if  y(<]  -  0  then  plz  decreases 
toward  0,  which  means  that  the  probability  of  producing  action  0  increases).  In  the 
case  of  penalty,  on  the  other  hand,  w  changes  so  that  the  probability  of  the  action 
not  chosen,  conditional  on  the  current  input  vector,  moves  toward  1.  Note  that  the 
parameter  A  in  Equation  2.4  determines  the  degee  of  asymmetry  in  the  magnitude 
of  the  weight  change  for  these  two  cases. 

It  is  shown  in  [7]  that  the  Ap_p  rule  reduces  under  various  restrictions  to  more 
conventional  learning  methods.  It  reduces  to  the  two-action  (nonassociative)  linear 

'This  version  of  the  Ap  p  rule  differs  from  that  given  in  Refs  j7,8,C>i  in  that  the  actions  are  0  and 
1  instead  of  - 1  and  1.  The  weight-update  rule  given  below  is  altered  so  that  the  two  versions  are 
exactly  equivalent.  The  0/1  form  allows  the  notation  to  be  a  bit  simpler. 
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reward-f-penalty  (Lr-(p)  learning  automaton  rule  [36j  when  each  77(f)  in  Equation  2.1 
is  uniform  in  the  interval  (-1,  lj,  the  input  pattern  is  constant  and  nonzero  over  time 
steps  (x[f[  =  i  ^  0),  and  the  initial  parameter  vector  u>[l)  is  such  that  u;[l|Tx  € 

[  —  1,1].  If  additionally  A  =  0,  then  the  Ar_p  rule  reduces  to  the  linear  reward-inaction 
(Lr-i)  rule  [36].  On  the  other  hand,  when  the  Ar_p  rule  is  made  deterministic  by 
letting  77(f)  =  0  for  all  f  (i.e.,  the  distribution  function  ^  is  the  step  function),  then 
the  Ar_p  rule  becomes  the  perceptron  learning  rule  [42).  With  a  slight  modification, 
the  Ar_p  rule  can  be  reduced  to  the  pattern-classification  method  introduced  by 
Widrow  and  Hoff  [59]  (the  adaline,  or  LMS,  algorithm).  Consequently,  the  AR_p  rule 
not  only  extends  learning  automata  capabilities  but  also  occupies  the  intersection  of 
important  classes  of  learning  algorithms.  Section  2  provides  some  background  on 
learning  automaton  methods.  The  Ar_p  rule  is  most  closely  related  to  the  “selective 
bootstrap  adaptation”  method  of  Widrow,  Gupta,  and  Maitra  [58],  to  which  it  is 
compared  in  [7[. 

A  convergence  theorem  is  proven  by  Barto  and  Anandan  [7]  by  extending  to 
the  associative  case  results  proven  by  Lakshmivarahan  [28,27],  It  holds  under  the 
following  conditions:  (Cl)  the  set  of  input  vectors  X  =  {art1*, . . . ,  x*m*},  x*’*  6  5R”, 
(C2)  for  each  x  £  X  and  f  >  1,  Pr{x\t\  =  x}  >  0;  (C3)  the  independent,  identically 
distributed  random  variables  77(f)  in  Equation  2.1  have  a  continuous  and  strictly 
monotonic  distribution  function  and  (C4)  the  sequence  p\t\  in  Equation  2.4  is 
such  that  p\t\  >  0,  £<  p\t\  =  00,  Y.tp\t\2  <  00.  We  can  prove  the  following  theorem: 

Theorem.  Under  conditions  (C1)-(C4),  for  each  A  £  (0,  l|,  there  exists  a  u>°x  £  9?" 
such  that  the  random  process  {«a[<] }«>!  generated  by  the  Ar_p  rule  in  an  asso¬ 
ciative  reinforcement  learning  task  converges  to  w°x  with  probability  1  (that  is, 
Pr{lim<  .oo  u>[f|  u;*}  -  1),  where  for  all  x  £  X, 

Pr{y  -  l\w°x,x}  >  1/2,  if  d(x,  1)  >  d(x,0); 

*"  1/2,  if  d(x,  1)  <  d(x,  0). 

In  addition,  for  all  x  £  X, 


P 


lim  Pr{y 

X  *0 


if  d(x,  1 )  d(x,  0); 
if  d(x,  1 )  <  d(x,  0). 
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According  to  the  usual  performance  criteria  for  learning  automata  [36],  this  result 
implies  that  for  each  i  €  X,  the  An-p  rule  is  c-optimal.  In  fact,  it  implies  a  strong 
form  of  c-optimality  for  each  x  €  X.  It  is  highly  unlikely  that  this  result  is  the  most 
general  that  can  be  proved  about  this  class  of  learning  rules  (see  [7]). 

As  is  often  done  when  using  similar  pattern-classification  methods,  in  most  of 
our  simulations  we  hold  p\t\  constant  in  order  to  increase  learning  speed  even  though 
a  weaker  form  of  convergence  in  this  case  has  not  yet  been  proven.  We  have  not 
yet  investigated  elaborations  of  the  Ar_p  rule  that  reduce  to  recursive  least  squares 
methods  based  on  the  Newton’s  algorithm,  but  these  have  the  possibility  for  showing 
improved  convergence  rates.  We  view  condition  (Cl)  that  the  set  of  input  vectors  is 
linearly  independent  as  the  most  serious  restriction  required  for  the  present  theorem. 
It  is  likely  that  this  restriction  can  be  removed  and  a  result  proved  that  involves 
some  form  of  operator  pseudoinverse. 


Simulation  of  a  Single  Ar_p  Unit 

In  order  to  illustrate  the  performance  of  the  Ar_p  learning  rule,  we  describe 
the  results  of  simulating  a  single  AR_p  unit  in  a  simple  associative  reinforcement 
learning  task  that  requires  discrimination  between  two  linearly  independent,  but 
non-orthogonal,  input  vectors.  We  use  as  a  measure  of  performance  the  probability 
that  the  unit  will  receive  reward  on  the  average  time  step  given  its  current  parameter 
vector.  We  denote  this  M\t\  when  computed  based  on  the  parameter  vector  tu[f|: 

M\t]  =  Y  ^\Pr{r\t\  =  l|x|t]  =  x}] 

x€  X 

rXX 

where  is  the  probability  that  input  pattern  x  occurs  on  any  trial.  This  measure  is 
maximized  when  the  optimal  action  for  each  input  pattern  occurs  with  probability 
1,  in  which  case  it  is 

Afmax  Y  ^max{d(x,  l),rf(z,0)}. 

Xf  x 
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The  distribution  function  of  the  random  variables  used  in  all  the  simulations  de¬ 
scribed  here  is  the  logistic  distribution  given  by  'if(s)  =  1/(1  +  e~'/T),  where  T  is 
a  parameter.  This  is  a  sigmoidal  function  that  is  similar  to  a  normal  distribution 
function  but  is  easier  to  evaluate.  It  is  also  used  in  the  studies  of  statistical  coopera- 
tivity  (e.g.,  Ref.  (24,1]),  where  T  is  the  “compute  Jonal  temperature”  of  the  system. 
As  T  approaches  zero,  the  distribution  function  approaches  a  step  function,  which 
means  that  the  Ar_p  unit  more  closely  approximates  a  deterministic  system.  Given 
this  distribution  function,  the  probability  plx[t]  in  Equation  2.4  is  as  follows  (from 
Equations  2.2  and  2.3): 

=  l-p°*[t\ 

=  1  -  ^(-u;[<jTx) 

-  1  -  [1/1  +  euj|fl Tz'T\ 

=  ^(u;|l]Tx). 


In  all  simulations  presented  here,  we  set  T  —  .5. 

In  the  first  simulation  the  input  vectors  are:  =  (1,0)T  and  x(2)  =  (1 , 1  )T, 

which  are  linearly  independent  but  not  orthogonal.  These  vectors  are  equally  likely 
to  occur  on  each  trial  (£iM,  =  =  .5).  The  weight  vector,  w,  is  zero  at  the  start 

of  each  sequence  of  trials,  which  makes  the  actions  initially  equiprobable  for  both 
input  vectors.  The  reward  probabilities  implemented  by  the  unit’s  environment  are 
given  by  the  following  table: 


x  d(x,0)  d(x,l) 


Thus  it  is  optimal  for  the  learning  system  to  respond  to  (0,  1  )T  with  action  1  to  obtain 
reward  with  probability  .9,  and  to  respond  to  (1,1  )T  with  action  0  to  obtain  reward 
with  probability  .4.  Therefore,  in  this  task  A/nmx  (.9  t  .4 ) / 2  ~  .65,  and  the  initial 
overall  reward  probability  is  (,6  +  .9+.4  +  .2)/  l  -  .525.  Note  that  any  nonassoc iative 
learning  automaton  algorithm  will  be  able  to  achieve  a  reward  probability  of  at  most 
(.9  t  ,2)/2  .55  by  learning  to  perform  the  action  1  at  all  times.  Also  note  that 


,N 


for  each  input  x,  the  reward  probabilities  are  either  both  greater  than  .5  or  both 
less  than  .5,  making  this  task  considerably  more  difficult  than  one  with  the  reward 
probabilities  placed  above  and  below  .5  for  each  x. 

Figure  la  shows  results  of  simulating  an  Ar_p  unit  in  this  task  with  three  different 
values  of  A:  .01,  .05,  and  .25.  We  held  the  parameter  p\t\  at  the  value  .5  for  all 
t.  Plotted  for  each  trial  t  is  the  average  of  M\t\  over  100  runs,  where  a  run  is  a 
sequence  of  5000  trials.  The  dashed  lines  show  theoretical  asymptotic  performance 
levels  for  the  three  values  of  A  (if  p[t\  were  decreasing  according  to  (C4)).  Note 
that  this  asymptote  approaches  the  optimal  performance  level  .65  as  A  decreases 
and  that  the  learning  rate  decreases  as  A  decreases.  The  average  final  parameter 
vectors  for  A  =  .01,  .05,  and  .25  are  respectively  (2.99,  -4.04)T,  (2.73,  -3.08)T,  and 
(1.91,  - 1 .71  )T.  Figure  lb  shows  a  plot  of  M\t\  for  one  of  the  runs  contributing 
to  the  average  shown  in  Fig.  la  for  A  =  .05.  Although  this  task  involves  only 
two-dimensional  pattern  vectors,  it  illustrates  the  essential  difficulties  of  learning  to 
discriminate  between  patterns  that  are  similar  by  virtue  of  sharing  a  subset  of  feature 
values. 

Ar _ p  Units  and  Stochastic  Learning  Automata 

The  theory  of  learning  automata  originated  with  the  independent  work  of  the  So¬ 
viet  cybernetician  Tsetlin  [55],  mathematical  psychologists  studying  learning  [16,19], 
and  statisticians  studying  sequential  decision  problems  (e.g.,  the  “n-armed  bandit 
problem”  [41]).  Although  this  theory  has  an  extensive  modern  literature  in  engi¬ 
neering  (reviewed  in  |36|),  there  has  been  very  little  cross-fertilization  between  this 
theory  and  neural-network  research.  In  this  subsection  1  briefly  describe  this  the¬ 
ory,  contrast  it  with  the  theory  of  supervised  pattern  classification,  and  describe 
how  learning  rules  like  the  Ar  p  rule  can  be  seen  as  a  synthesis  of  aspects  of  these 
theories. 

Figure  2  shows  a  learning  automaton  interacting  with  an  environment.  At  each 
step  in  the  processing  cycle,  the  automaton  randomly  picks  an  action  from  a  set 
of  possible  actions,  Y  {yi , .  . .  ,  yk  },  according  to  a  vector  of  action  probabilities. 
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Figure  1:  Simulation  Results  for  a  Single  A 
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Figure  2:  Stochastic  Learning  Automaton  Interacting  with  a  Random  Environ¬ 
ment. 

P  {p(yi) . p(yt)}.  The  environment  then  evaluates  that  action  by  selecting  an 

evaluation  signal  that  it  transmits  back  to  the  automaton.  Figure  2  shows  the  case 
in  which  the  evaluation,  r,  is  either  “success”  or  “fdlure”  and  is  selected  according 

to  probabilities  {rfi . dk}.  where  dt  =  prob{success|y, }  (other  formulations  allow 

a  countable  number  or  a  bounded  continuum  of  evaluations).  Upon  receiving  the 
evaluation,  the  automaton  updates  its  action  probabilities  as  a  function  of  its  current 
action  probabilities,  the  action  chosen,  and  the  environment's  evaluation  of  that 
action.  Beginning  with  no  knowledge  of  the  environmental  success  probabilities,  the 
object  ive  of  the  automaton  is  to  improve  its  expectation  of  success  over  time.  Ideally. 

it  should  eventually  choose  action  y;  with  probability  1.  where  dt  max{(/i . dk}. 

Many  different  algorithms  have  been  st  udied  under  a  number  of  different  performance 
measures,  and  many  convergence  results  have  been  proven  j 30 j . 

Theorists  have  become  increasingly  interested  in  the  collective  behavior  of  learn¬ 
ing  automata.  Figure  shows  collections  of  A  learning  automata  interacting  wit  h  an 
environment.  In  Fig.  -ia.  each  automaton  receives  a  different  evaluation  signal  that 
depends,  in  general,  on  the  actions  of  all  A  automata.  This  models  the  situation 
in  which  the  automata  have  differing,  and  possibly  conflicting,  interests.  "This  is  a 
game  decision  problem.  In  contrast  to  the  problems  studied  in  classical  game  theory. 


(a)  (b) 

Figure  3:  (a)  The  game  problem,  (b)  The  team  problem. 

the  automata  operate  in  total  ignorance  of  the  payoff  structure  of  the  game  and  the 
presence  of  the  other  automata.  In  the  case  of  zero-sum  games  (games  of  pure  con¬ 
flict),  theoretical  results  show  that  when  employing  certain  algorithms,  the  learning 
automata  converge  to  the  game’s  solution  (if  that  solution  involves  pure  strategies; 
see  Refs.  [29,30,56]). 

Figure  3b  shows  a  collection  of  learning  automata  in  the  team  situation,  which  is 
the  special  case  of  the  game  situation  in  which  the  automata  receive  the  same  evalua¬ 
tion  signal.  In  this  case,  the  automata  have  a  common  goal  but  each  automaton  only 
has  partial  control  over  the  evaluation.  As  in  the  case  of  games,  the  learning  process 
in  this  case  is  incompletely  understood,  but  a  number  of  mathematical  results  have 
been  proven,  the  strongest  of  which  shows  that  certain  stochastic  learning  automaton 
algorithms  lead  to  monotone  increases  in  performance  [37], 

Comparing  stochastic  learning  automata  and  the  typical  adaptive  units  used  in 
theoretical  neural-network  research  reveals  several  important  differences.  First,  a 
typical  neuron-like  adaptive  unit  has  multiple  input  pathways  that  carry  patterned 
stimulus  information.  Such  a  unit  might  also  have  a  pathway  specialized  for  training, 
such  as  the  pathway  for  the  desired  response  of  a  Widrow/HofT  Adaline  or  a  Per- 
ceptron  unit.  The  learning  process  causes  the  unit  to  implement  or  approximate  a 
desired  mapping  from  stimulus  patterns  to  responses.  A  learning  automaton,  on  the 
other  hand,  only  has  a  single  input  pathway  for  the  evaluation  signal.  Learning  ei¬ 
ther  results  in  the  selection  of  a  single  optimal  action  or  a  suitable  action  probability 
vector  no  (nontrivial)  mapping  is  produced.  On  this  dimension  of  comparison,  then, 
the  usual  adaptive  units  are  doing  something  more  sophisticated  than  art'  learning 


automata. 


However,  the  usual  adaptive  unit  requires  an  environment  that  directly  provides 
either  a  desired  response  or  a  signed  error  that  directly  tells  the  unit  what  response 
it  should  have  produced.  In  contrast,  a  learning  automaton  has  to  discover,  in  a 
stochastic  environment,  which  action  is  best  by  sequentially  producing  actions  and 
observing  the  results.  Since  there  are  no  constraints  on  the  success  probabilities, 
information  gained  from  performing  one  action  provides  no  information  about  the 
consequences  of  the  other  actions.  This  can  be  a  non-trivial  problem  even  in  the  case 
of  two  possible  actions  and  is  fundamentally  different  from  the  supervised  learning 
problem  { 18] .  Therefore,  in  terms  of  the  amount  of  information  required  for  successful 
learning,  a  stochastic  learning  automaton  implements  a  form  of  learning  that  is  more 
powerful  than  the  supervised  learning  performed  by  most  neuron-like  adaptive  units. 

Because  typical  network  adaptive  units  and  learning  automata  excel  on  different 
dimensions,  it  has  been  fruitful  to  study  learning  units  that  combine  the  capabilities 
of  these  two  types  of  systems.  The  resulting  units,  such  jus  Ar_p  units,  are  able  to 
learn  mappings  in  the  absence  of  explicit  instructional  information.  This  ability  has 
implications  for  applications  to  learning  control  problems  as  discussed  in  Section  5. 
Units  such  as  AR_P  units  can  also  participate  in  team  or  game  decision  problems  sim¬ 
ilar  to  those  in  which  learning  automata  have  been  studied.  Unlike  nonassociative 
learning  automata,  however,  these  units  can  learn  to  act  conditionally  on  information 
from  a  variety  of  sources,  including  other  units  in  a  collection.  Consequently,  collec¬ 
tive  behavior  more  complex  than  that  produced  by  nonassociative  learning  automata 
can  be  procuced  by  networks  of  units  combining  associative  learning  with  reinforce¬ 
ment  learning.  The  following  quotation  illustrates  that  Tsetlin  [55]  was  similarly 
interested  in  more  elaborate  forms  of  collective  behavior: 

We  have  discussed  very  simple  forms  of  behavior,  and  for  this  reason 
we  limited  ourselves  to  the  simplest  types  of  automata.  The  exchange 
of  information  among  these  automata  takes  place  in  the  language  of 
penalties  and  rewards.  Although  this  language  seems  universal  enough, 
it  would,  however,  be  interesting  to  also  look  at  more  complicated  au¬ 
tomata  that  possess  some  specialized  language  to  communicate  to  other 
automata.  Such  automata  are  needed  to  describe  more  complex  forms 


of  behavior.  These  more  complex  behavioral  forms  necessitate  the  use  of 
much  more  diverse  information,  (p.  125) 

To  the  best  of  my  knowledge,  there  has  been  no  systematic  attempt  to  study  the 
collective  behavior  of  learning  automata  that  communicate  in  this  way.1  Some  of 
our  research  represents  the  beginning  of  this  type  of  study  as  described  in  the  next 
section. 


-Recent  work  l>y  That li nr  liar  aiirl  Sastry  1 5 1  use?  storha«tii  learning  automata  in  an  algorithm 
for  supervised  pattern  classification.  Although  fins  algorithm  combines  pattern  classifn  at  ion  anil 
learning  automata  in  a  very  interesting  manner,  it  rloes  not  involve  mutually  <  ommunir  .it  mg  learning 
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SECTION  3 


Cooperative  Behavior  of  Ar_p  Units 


In  this  section,  I  present  an  overview  of  our  studies  of  cooperating  collections  of 
Ar_p  units.  Since  details  of  the  simulations  are  provided  elsewhere  [6j,  I  mainly  dis¬ 
cuss  the  significance  of  these  results  and  their  relationship  to  other  lines  of  research: 
the  collective  behavior  of  stochastic  learning  automata,  game  and  team  decision  the¬ 
ory,  and  other  methods  for  learning  in  layered  networks.  In  Section  7,  I  briefly  discuss 
problems  with  scaling  this  approach  to  larger  problems  and  suggest  how  they  might 
be  solved  by  means  of  modularity  and  local  reinforcement. 


Associative  Search  Networks  and  Team  Decision  Problems 

Our  early  work  with  the  networks  we  called  associative  search  networks,  or  ASNs, 
stressed  the  ability  of  these  networks  to  learn  associative  mappings  in  the  absence 
of  explicit  instructional  information  [  14, 1 1  j .  Figure  4  shows  an  ASN.  It  differs  from 
the  usual  single-layer  associative  memory  networks  discussed  in  the  connectionist 
literature  (e.g.,  Ref.  [23])  because  instead  of  having  reference  channels  for  specify¬ 
ing  desired  outputs  of  the  units,  it  has  a  single  channel  for  broadcasting  a  scalar 
evaluation  signal  to  all  of  the  network’s  units.  We  studied  ASNs  in  associative  re¬ 
inforcement  learning  tasks  ( 14,1 2,1 1 ).  The  object  of  such  a  task  is  to  construct  the 
mapping  that  associates  each  key  with  the  action  (recollection)  that  yields  the  best 
possible  evaluation  from  the  environment.  A  basic  assumption  is  that  the  network 
has  no  a  priori  knowledge  about  the  environment's  evaluation  function.  If  a  network 
ran  solve  this  task,  then  the  associative  mapping  it  constructs  has  exactly  the  same 
properties  as  the  mappings  learned  by  the  usual  associative  memory  networks.  In  this 
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Scalar  Evaluation 


Figure  4:  Associative  search  network. 


case,  however,  the  mapping  would  be  formed  in  the  absence  of  explicit  instructional 
information.  The  scalar  evaluation  signal  contains  much  less  information  than  the 
reference  vector  required  for  storing  information  in  the  conventional  case — it  can  be 
generated  by  an  environment  that  can  evaluate  the  behavior  of  the  network,  that  is, 
the  collective  behavior  of  the  network’s  units,  but  cannot  specify  the  desired  behavior 
of  each  individual  component. 

In  addition  to  relating  ASNs  to  associative  memory  networks,  one  can  relate  them 
to  the  teams  of  stochastic  learning  automata  mentioned  in  Subsection  2  and  shown 
in  Fig.  3.  An  ASN  in  an  associative  reinforcement  learning  task  is  a  generalization 
of  a  set  of  learning  automata  in  a  team  decision  problem.  If  one  were  to  hold  the 
input  pattern  to  the  ASN  fixed  for  all  time,  the  result  would  be  the  same  as  a  team 
of  nonassociative  learning  automata  facing  a  team  decision  problem.  Since  all  units 
receive  the  same  reinforcement,  they  have  no  conflicts  of  interest.  Consequently, 
the  ability  of  an  ASN  to  search  for  optimal  patterns  can  be  seen  to  arise  from  the 
cooperative  activity  of  the  adaptive  units  as  each  attempts  to  maximize  its  own 
performance.  The  ability  of  the  adaptive  units  of  an  ASN  to  do  this  conditionally  on 
information  provided  by  the  input  patterns  implies  that  the  units  cooperate  to  form 
associative  mappings. 
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Layered  Teams  of  Ar_p  Units 


A  natural  extension  to  the  single-layer  ASN  is  to  add  additional  layers  of 
Ar_p  units.  In  these  networks,  units  learn  to  act  conditionally  on  information  pro¬ 
vided  by  other  units  in  the  network  as  well  as  information  provided  by  the  network’s 
environment.  As  a  result,  layered  networks  can  learn  to  implement  nonlinear  asso¬ 
ciative  mappings.  Suppose  the  network’s  environment  presents  stimulus  patterns  to 
the  network  by  making  the  patterns’  components  available  as  input  to  some  subset 
of  the  network’s  units.  We  call  the  units  that  receive  this  external  stimulation  the 
input  units.  The  output  signals  of  another  subset  of  units  are  received  by  the  envi¬ 
ronment,  and  patterns  of  these  signals  constitute  the  “overt”  actions  of  the  network. 
These  are  the  output  units,  or  to  use  the  term  of  Hinton  and  Sejnowski  [24|,  “visible 
units.”  The  units  that  are  not  output  units  (including  any  input  units  that  are  not 
output  units)  we  call  the  “hidden  units”  after  Hinton  and  Sejnowski  [24].  1  Suppose 
that  the  environment  evaluates  the  activity  of  the  visible  units  and  broadcasts  a 
reinforcement  signal  to  all  the  units  of  the  network. 

How  can  a  hidden  unit  improve  its  reward  probability  when  its  output  cannot 
directly  affect  the  environment?  The  only  possibility  is  for  it  to  assist  visible  units 
in  increasing  their  reward  probabilities;  and  this  might  be  possible  only  by  assisting 
intermediate  units.  For  example,  a  hidden  unit  might  adjust  its  weights  in  order 
to  produce  a  signal  A  that  another  hidden  unit  combines  with  other  information  to 
produce  a  signal  B,  where  signal  B,  in  turn,  allows  a  visible  unit  to  make  a  required 
discrimination.  The  adaptive  units  must  be  able  to  discover  how  they  can  contribute 
to  the  common  goal.  We  regard  the  linking  up  of  units  under  these  conditions  to  be 
a  form  of  cooperation  by  which  units  coordinate  their  activities  for  mutual  benefit. 

'Note  that  our  use  of  these  terms  differs  slightly  from  their  usage  by  Hinton  and  Sejnowski  and 
others.  They  replace  each  of  our  network  input  pathways  with  a  specialized  unit  whose  act  ivat  ion  can 
he  clamped  to  specific  values  by  the  network’s  environment,  and  they  call  these  units  visible  units 
too.  I  have  always  preferred  not  to  do  this  given  my  background  in  switching  and  automata  theory. 


A  Minimal  Layered  Network  of  Ar-p  Units 

Figure  5  shows  a  network  of  two  Ar_p  units,  Uj  and  u2.  Only  tij  receives  stim¬ 
ulus  patterns  from  the  environment,  and  only  the  action  of  u2  is  available  to  the 
environment  (uj  is  hidden;  u2  is  visible).  Suppose  this  network  faces  an  associa- 
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Figure  5:  A  Minimal  Layered  Network  of  AR_p  Units 

tive  reinforcement  learning  problem  in  which  the  network’s  output,  the  output  of 
u2,  affects  the  reward  probability  in  a  manner  that  depends  on  the  stimulus  pattern 
presented  to  it,.  Both  units  receive  the  same  reinforcement  signal.  If  there  were  no 
means  for  u,  to  communicate  with  u2,  the  units  would  be  capable  of  achieving  only 
limited  reward  frequencies.  The  action  of  u2  influences  the  reinforcement  of  both 
units,  but  in  the  absence  of  a  communication  link,  u2  remains  blind  to  the  discrimi¬ 
native  stimulus  and  therefore  cannot  learn  to  respond  selectively  in  a  discrimination 
task.  On  the  other  hand,  in  the  absence  of  a  communication  link,  U\  can  sense  the 
discriminative  stimulus  but  cannot  influence  the  reinforcement  received.  The  com¬ 
plementary  specialties  of  the  two  units  have  to  be  combined  in  order  for  each  to 
attain  optimal  performance.  In  simulating  this  situation,  we  arranged  for  the  action 
of  ui  to  potentially  influence  u2  by  providing  an  interconnecting  pathway  with  an 
initial  weight  of  zero.  If  this  weight  can  be  adjusted  properly,  the  network  can  re¬ 
spond  correctly.  However,  the  correct,  value  of  the  interconnecting  weight  depends 
on  how  U|  has  learned  to  respond  to  its  input.  Conversely,  the  correct  behavior  of  u, 
depends  on  the  value  of  the  interconnecting  weight,  that  is,  on  how  u2  has  learned 
to  respond  to  its  input  signals.  Thus  the  two  units  must  adapt  simultaneously  in  a 
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tightly-coupled  cooperative  fashion  in  order  to  maximize  reward  frequency. 

To  be  more  specific,  we  set  up  the  simulation  in  the  following  way.  Each  unit  is 
provided  with  a  constant  input  (equal  to  1)  to  allow  its  threshold  to  vary  and  one 
other  input  pathway.  We  regard  only  this  second  stimulus  component  as  the  stimulus 
pattern  x,  treating  the  constant  input  as  part  of  a  unit’s  internal  mechanism.  Each 
unit  of  the  network  in  Fig.  5  can  therefore  receive  the  input  “pattern”  0  or  1,  where 
for  u,  it  is  generated  by  the  network’s  environment,  and  for  u2  it  is  the  output  of 
uj.  The  reward  probabilities  implemented  by  the  network’s  environment  are  given 
by  the  following  table: 


Table  entry  d(x,y)  is  the  network  reward  probability  given  that  u,  receives  x  as 
input  and  u2  responds  with  y  as  output,  that  is,  given  that  the  network  as  a  whole 
responds  to  x  with  y.  Thus  it  is  optimal  for  the  network  to  respond  to  x  0  with 
action  0  to  obtain  reward  with  probability  .9,  and  to  respond  to  x  =  1  with  action  1 
to  obtain  reward  with  probablity  .9.  In  this  task  A/max  ~  (.9  +  .9)/2  =  .9,  and  the 
initial  overall  reward  probability  (with  all  weights  zero)  is  (.  9  +  .1  +  .1  +  .9)/ 4  .5. 
Note  that  if  the  network  fails  to  discriminate  by  responding  identically  to  all  input 
patterns,  the  overall  reward  probability  is  (.9  t  . l)/2  .5. 

There  are  two  ways  the  network  can  solve  I  his  problem.  Let  us  denote  the  weights 
associated  with  u,’s  (nonconstant)  input  pathway  w' ,  i  1,2.  In  the  first  solution, 
ui  learns  to  fire  only  when  stimulus  x  1  is  present  by  setting  its  threshold  high 
(i.e.,  setting  its  threshold  weight  negative)  and  setting  ?r'  positive.  Unit  u2  does 
the  same  thing  sets  its  threshold  high  and  ir2  positive  so  that  it  it  fires  only  when 
stimulated  by  u/s  firing.  Consequently,  the  network  as  a  whole  fires  only  when  x  1. 
In  the  second  solution,  i/ 1  learns  to  fire  at  all  limes  except  when  stimulus  x  I  is 
present,  and  u2  learns  to  fire  at  all  times  except  when  i/,  fires.  Then  when  ?/ ,  is  silent 
in  response  to  r  1,  u2  is  disinhibitod  and  so  fires. 

In  simulating  a  trial  with  this  network,  and  with  all  the  networks  to  be  described, 
the  environment  first  presents  a  stimulus  pattern  to  the  network,  and  then  proceed- 
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ing  from  the  input  side  of  the  network,  we  sequentially  compute  the  output  of  the 
successive  units  so  that  their  actions  are  available  as  input  to  “downstream”  units. 
This  is  possible  because  the  networks  described  here  do  not  have  recurrent  con¬ 
nections.  When  the  network’s  overt  action  is  generated,  the  environment  produces 
the  reinforcement  signal,  and  all  the  units  update  their  weights.  We  view  the  weight 
modifications  as  occurring  simultaneously  for  all  units,  although  this  is  actually  done 
sequentially  by  the  computer  program. 

Figure  6  shows  the  behavior  of  the  network  for  a  typical  sequence  of  500  trials 
with  A  =  .04  and  p  =  1.5.  Figure  6a  shows  the  evolution  of  the  behavior  of  U!  in 
terms  of  two  graphs.  The  first  shows  the  conditional  probability  that  ui  fires  (jq  =  1) 
given  that  its  (nonconstant)  input  is  0,  and  the  second  shows  the  same  thing  for  input 
1.  Both  of  these  probabilities  start  at  .5  since  the  weights  are  initially  zero,  and  they 
change  in  approximately  the  same  way  for  about  the  first  50  trials.  This  means 
that  during  these  trials  the  unit  is  experimenting  with  firing  and  not  firing  in  the 
presence  of  both  input  signals.  At  this  point  the  two  conditional  probabilities  show 
the  beginning  of  differentiation  between  the  two  cases,  which  becomes  unequivocal  by 
about  trial  80.  From  then  on,  with  a  few  brief  exceptions,  uv  has  a  high  probability 
of  firing  in  response  to  an  input  of  1  and  a  low  probability  of  firing  in  response  to 
an  input  of  0.  Figure  6b  shows  the  evolution  of  the  mapping  implemented  by  U\ 
and  u2  acting  together  by  showing  the  probability  that  u2  fires  (t/2  —  1)  for  the 
different  values  of  the  network  input  x  ( not  for  the  values  of  u2’s  local  input).  Since 
the  network  learns  to  respond  correctly,  u2  learns  to  remain  silent  unless  excited  by 
u/s  activity;  that  is,  the  first  solution  is  formed  in  which  both  uq  and  tiq  become 
positive  and  both  units  set  high  thresholds.  Figure  6c  shows  the  evolution  of  the 
overall  performance  measure  M(.  Figure  6d  is  a  histogram  of  the  number  of  trials 
required  to  reach  a  criterion  of  98%  of  A/mr>x  for  each  of  100  sequences  of  trials.  In 
all  sequences  the  network  reached  this  criterion  before  1500  trials.  In  45%  of  the 
sequences,  the  network  produced  the  first  solution;  in  the  remainder  it  produced  the 
second. 

A  series  of  two  units  in  a  discrimination  task  provides  one  of  the  simplest  examples 
we  could  devise  to  demonstrate  statistical  cooperativitv  of  self-interested  units.  It  is 


clear  that  the  AR  |>  units  effectively  form  a  link  that  permits  them  to  obtain  higher 
reward  rates  than  they  could  attain  if  they  were  to  act  independently.  Moreover, 
a  unit  contributes  to  the  formation  of  this  link  only  because  doing  so  furthers  its 
interests.  We  interpret  this  as  a  form  of  cooperativity  in  the  literal  game-theoretic 
sense.  One  may  regard  the  link  as  a  “binding  agreement”  by  which  the  units  form 
a  coalition  for  mutual  benefit.  We  have  simulated  series  of  3,  4,  and  5  units  with 
appropriate  connections  being  made  in  all  cases,  although  learning  slows  considerably 
as  the  depth  of  the  network  increases.  Although  the  discrimination  required  in 
these  tasks  is  not  difficult,  the  necessity  to  construct  a  long  chain  of  connections 
that  faithfully  transmits  the  discriminative  stimulus  is  quite  difficult.  The  correct 
behavior  for  any  unit  depends  on  the  behavior  implemented  by  all  the  other  units  so 
that  the  solution  cannot  be  constructed  from  stable  solutions  to  subtasks. 

The  XOR  Task 

In  the  task  just  described,  cooperative  learning  is  required  only  because  the  net¬ 
work  lacks  a  direct  pathway  from  input  to  output.  The  task  itself  is  easily  within 
the  capabilities  of  a  single  unit.  Here  we  illustrate  the  simplest  example  of  a  task 
that  cannot  be  solved  by  a  single  linear  threshold  unit,  or  any  single-layer  network  of 
them.  In  this  problem  the  hidden  unit  is  needed  not  just  to  transmit  a  discriminative 
stimulus  to  the  visible  unit;  the  hidden  unit  must  learn  to  respond  to  particular  con¬ 
figurations  of  its  stimulus  components  in  order  to  create  a  signal  that  the  visible  unit, 
needs  to  behave  properly.  In  our  simulation,  a  network  of  two  AR_P  units  is  placed 
in  a  task  requiring  it  to  form  the  two-component  exclusive-or  mapping.  The  network 
has  a  single  hidden  unit,  u,,  and  a  single  visible  unit,  u2,  which  are  connected  as 
shown  in  Fig.  7.  The  stimulus  patterns  are  all  the  two-component  binary  vectors: 
r(n|  (0,0),  rM)  (0,  l).j-,2)  (l,0),x(:,)  (1.1).  These  patterns  are  equally  likely 
to  occur  on  any  trial.  Kach  unit  also  has  a  constant  input  and  a  threshold  weight. 


The  reward  probabilities  are  given  by  the  following  table: 


X 

d(x,  0) 

d(x, 1) 

(0,0) ! 

.9 

.1 

(o,i)| 

.1 

.9 

(i,o) 

.1 

.9 

(1.1)1 

.9 

.1 

Table  entry  d(x,y)  is  the  reward  probability  given  that  the  network  receives  x  as 
input  and  responds  with  action  y.  The  optimal  reward  probability  is  Mmajt  =  .9, 
which  is  obtained  when  the  action  of  the  visible  unit  is  the  exclusive-or  of  the  pattern 
components,  that  is,  when  u2  fires  when  one  or  the  other,  but  not  both,  stimulus 
components  are  present.  It  must  also  not  fire  when  both  components  are  absent.  A 
single  AR_p  unit  can  be  correct  for  at  most  three  of  the  four  cases,  yielding  a  reward 
probability  of  .7,  since  weights  do  not  exist  that  allow  a  single  linear  threshold  unit 
to  respond  correctly  to  all  four  stimuli  (see  Duda  and  Hart,  1973,  or  Minsky  and 
Papert,  1969).  However,  the  performance  of  the  network  of  Fig.  7  can  approach 
A/mrix  if  the  hidden  unit  learns  to  respond  only  to  the  fo  rth  case  and  the  visible  unit 
takes  advantage  of  this  signal  to  “debug”  its  responding.  This  can  happen  in  several 
ways  depending  on  whether  the  hidden  unit  learns  to  turn  on  or  off  for  the  fourth 
case. 


r 


Figure  7:  Network  for  the  Kxrlusive-Or  Task. 

Figure  8  shows  performance  of  the  two-unit  network  for  a  typical  sequence  of  5000 
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trials  with  p  ~  1.5  and  A  —  .08.  In  Fig.  8a  are  graphs  showing  how  the  output  proba¬ 
bilities  of  the  visible  unit  develop  for  each  input  pattern;  Fig.  8b  shows  the  analogous 
information  for  the  hidden  unit;  and  Fig.  8c  shows  the  overall  performance  of  the 
network  as  a  function  of  the  trial  number.  The  visible  unit  quickly  learns  to  respond 
correctly  to  all  patterns  except  x^1'  =  (Oil)  (Fig.  8a),  causing  the  network  perfor¬ 
mance  to  level  off  near  .7  (Fig.  8c).  Eventually  ( t  «  1400)  the  hidden  unit  comes  to 
respond  reliably  to  x*1)  and  to  reliably  not  respond  to  any  other  pattern  (Fig.  8b). 
At  the  same  time,  the  visible  unit  begins  to  be  excited  by  the  hidden  unit’s  signal 
so  that  its  output  tends  to  be  correct  more  frequently  for  all  four  patterns  (Fig.  8a). 
Once  this  mutually  beneficial  relationship  between  and  u2  begins,  it  quickly  devel¬ 
ops  until  almost  perfect  performance  is  achieved  (the  theoretical  asymptote  is  .892 
for  this  value  of  A).  It  is  clear  that  this  is  a  cooperative  process. 

Figure  9  shows  a  histogram  of  the  number  of  trials  until  a  criterion  of  95%  of 
/Vfmax  is  attained  for  each  of  100  sequences  of  trials.  The  average  number  of  trials 
until  criterion  is  3501,  or  about  875  trials  for  each  stimulus  pattern.  In  all  of  the 
sequences  the  network  reached  this  criterion  before  15,000  trials. 

The  Multiplexer  Task 

The  network  shown  in  Fig.  10  has  six  input  pathways  and  a  single  principal  output 
pathway  (from  unit  5).  There  are  39  weights  to  adjust:  one  associated  with  each 
of  the  pathway  intersections  and  one  threshold  weight  for  each  unit.  The  reward 
contingencies  implemented  by  the  network’s  environment  force  the  network  to  learn 
to  realize  a  multiplexer  circuit  in  order  to  obtain  optimal  performance.  A  multiplexer 
is  a  device  with  k  address  input  pathways  and  2*  data  input  pathways  (here  k  2), 
each  of  which  is  associated  with  a  distinct  /r- bit  address.  Given  a  pattern  over  the 
address  pathways,  i.e.,  an  address,  a  multiplexer’s  output  is  equal  to  whatever  signal 
(0  or  I)  appears  on  the  data  pathway  associated  with  that  address.  It  therefore 
routes  signals  from  different  input  pathways  to  a  single  output  pathway  depending 
on  the  “context,"  provided  by  the  pattern  over  the  address  pathways.  If  we  call  the 
address  components  a,  and  n2  and  the  data  components  and  c/4,  a  minimal 
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Figure  9:  Histogram  of  Trials  to  Criterion  for  100  Sequences  of  Trials  in  the 
Exclusive-Or  Task. 


Figure  10:  Layered  Network  for  the  Multiplexer  Problem. 
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logical  expression  for  the  multiplexer  function  is 

( iiu^di  s/  ata2^2  V  o,\Q,2(iz  V  (iid2  d*. 

There  are  a  total  of  2°,  or  64,  input  patterns. 

For  each  of  the  64  possible  input  patterns,  we  rewarded  each  unit  of  the  network 
with  probability  1  if  the  visible  unit  (unit  5)  produced  the  correct  output,  and  we 
penalized  each  unit  with  probability  1  otherwise.  The  input  patterns  were  chosen 
randomly  for  presentation  to  the  net.  All  of  the  units  implement  the  Ar_p  algorithm 
with  T  —  -5  except  for  the  visible  unit  (unit  5)  which  uses  T  =  0  (and  therefore 
essentially  uses  the  perceptron  algorithm;  see  Section  2).  Fig.  11  is  a  histogram 
of  the  number  of  trials  required  for  the  network  to  respond  99%  correctly  for  1000 
consecutive  trials  in  each  of  30  sequences  of  trials  with  p  -  1  and  A  .01.  The 
average  number  of  trials  required  is  133,149,  or  about  2080  presentations  of  each 
stimulus  pattern.  In  every  sequence  the  network  reached  the  criterion  before  350,000 
trials. 
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Figure  11:  Histogram  of  Trials  to  Criterion  for  the  Multiplexer  Task. 

This  task  illustrates  some  of  the  computational  sophistication  that  can  arise  with 
the  formation  of  nonlinear  functions.  Linear  threshold  functions  can  exhibit  only  a 
very  restricted  form  of  context  sensitivity:  contextual  information  can  bias  activation 
one  way  or  the  other,  effectively  raising  or  lowering  a  threshold.  Nonlinear  context 
sensitivity,  on  the  other  hand,  can  result  in  the  complete  alteration  of  behavior  as 
a  function  of  contextual  information.  The  exlusive-or  task  described  in  Section  3 


illustrates  this  in  the  simplest  form,  where  one  stimulus  component  can  he  regarded 
as  switching  the  processing  of  the  second  stimulus  component  between  the  identity 
and  inversion  functions.  The  multiplexer  illustrates  a  more  extreme  form  by  which 
the  contextual  information  provided  over  the  address  pathways  completely  alters  the 
set  of  signals  to  which  the  principal  unit  is  sensitive. 

Discussion — Ar_p  Networks  and  Gradient  Descent 

Not  long  after  we  began  experimenting  with  networks  of  Ar  p  units,  Rumelhart, 
Hinton,  and  Williams  [44 j  presented  an  error  back-propagation  method  for  learning 
in  layered  networks  that  has  since  become  well-knowr  This  method  is  described 
in  Section  4,  where  its  performance  is  compared  with  that  of  several  other  methods 
including  the  Ar_p  method.  This  error  back-propagation  method  is  now  deservedly 
popular  since  it  is  simple  to  understand  and  outperforms  other  methods  for  learning 
in  layered  networks,  including  ours  based  on  Ar_p  units.  What  is  most  interesting 
here  is  that  the  error  back-propagation  method  together  with  a  theoretical  result  of 
Williams  (61,62)  sheds  much  light  on  the  collective  behavior  of  Ar  p  units  in  layered 
networks.  In  fact,  it  is  not  too  misleading  to  regard  Ar  p  networks  as  performing  a 
kind  of  stochastic  approximation  to  the  back-propagation  method  (although  this  is 
not  strictly  true  for  several  reasons  to  be  discussed). 

The  error  back-propagation  method  is  a  gradient  descent  procedure  in  weight 
space.  The  remarkable  result  is  that  information  about  how  to  step  in  weight 
space  to  minimize  (or  maximize)  a  global  network  performance  criterion  can  be 
obtained  locally  in  the  networks.  In  the  case  of  the  back-propagation  algorithm, 
this  information  the  partial  derivative  of  the  performance  criterion  with  respect 
to  each  weight  is  obtained  through  a  complex  process  in  which  error  signals  are 
transformed  and  passed  backward  through  the  network.  Another  way  for  a  unit  to 
determine  what  steps  to  take  in  weight  space  is  for  it  to  determine  the  derivative 
of  the  performance  criterion  with  respect  to  its  activity  by  varying  its  output  and 
observing  how  the  global  performance  changes  as  a  result,  (liven  this  estimate,  the 
unit  can  then  correctly  determine  how  to  change  its  weights. 


More  specifically,  suppose  the  units  arc  deterministic,  and  that  a  given  hidden 
unit  can  vary  its  output  around  its  current  value  while  the  outputs  of  all  of  the 
other  units  are  frozen  at  their  current  values.  By  observing  the  consequences  of  this 
variation  on  the  performance  criterion,  the  unit  can  determine  the  gradient  of  the 
criterion  with  respect  to  its  output  at  the  current  point  in  weight  space.  From  this 
it  can  easily  determine  the  criterion’s  derivative  with  respect  to  its  weights,  and  so 
can  alter  them  appropriately.  Now  each  unit  in  turn  can  do  this  with  the  other  units 
frozen.  If  a  unit’s  new  weights  are  not  put  into  place  until  all  the  units  have  varied 
their  outputs,  the  result  will  be  a  step  in  weight  space  according  to  the  gradient  of  the 
criterion.  This  process,  which  is  reminiscent  of,  but  different  from,  the  Boltzmann 
relaxation  process,  would  work  but  has  obvious  shortcomings  since  some  outside 
agency  would  have  to  orchestrate  the  process  and  it  would  be  quite  slow. 

But  can  the  units  vary  their  outputs  simultaneously  and  observe  the  consequences 
to  achieve  the  same  result?  This  could  be  made  to  work  if  the  units  independently 
influenced  the  criterion  function,  but  it  is  difficult  to  see  how  it  could  be  done  if  these 
influences  are  not  independent,  which  is  the  only  case  of  real  interest.  It  turns  out, 
however,  that  it  is  possible  for  interacting  units  to  simultaneously  vary  their  outputs 
to  obtain  an  estimate  of  the  appropriate  gradient.  This  is  essentially  what  happens 
in  networks  of  Ar  p  units.  Williams  has  shown  [61 1  for  an  arbitrary  acyclic  network 
of  Ar  r  units  that  if  the  parameter  A  in  Equation  2.4  is  zero  for  each  unit2,  then 
the  expected  direction  of  each  weight  change  is  proportional  to  the  gradient  of  the 
global  network  reward  probability.  Consequently,  each  weight  changes  according  to 
an  unbiased  estimate  of  the  partial  derivative  ol  the  global  criterion  function  with 
respect  to  that  weight.  On  any  particular  trial,  the  step  in  weight  space  actually 
taken  may  or  may  not  amount  to  an  improvement,  but  the  trend  will  always  be  in 
the  correct,  direction. 

Thus,  Ar  p  networks  (with  A  0)  provide  a  wav  of  locally  computing  gradient 
information  without  the  need  for  a  complex  back-propagation  process.  We  have 
found  that  in  practice  such  networks  actually  require  A  to  be  nonzero  (in  fact,  a 

‘We  rail  unit.®  with  A  0,  A/j  /  unit?,  for  A^sO'Uitivr  Hrtrnn/.  fruu  t/on  units:  upon  penalty,  no 
weight  change?  neon. 
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positive  value  much  smaller  than  p)  in  order  to  converge  properly.  Although  the 
average  direction  of  weight  change  is  correct  when  A  =  0,  the  process  can  get  stuck 
at  suboptimal  points  in  weight  space  because  the  units  become  deterministic  too 
soon.  Setting  A  nonzero  seems  to  prevent  this  from  happening  by  eliminating  all 
absorbing  states  from  the  stochastic  process.  Consequently,  even  after  learning  is 
complete,  all  the  units  retain  a  small  amount  (depending  on  the  size  of  A)  of  random 
variability  in  their  behavior. 

This  view  of  Ar_p  networks  provides  a  link,  albeit  an  approximate  one,  to  the 
gradient  descent  procedure  implemented  by  the  Rumelhart  et  al.  back-propagation 
method.  The  link  is  not  exact  for  two  reasons:  1)  since  Ar_p  units  are  binary 
whereas  back-propagation  units  have  continuous  outputs,  the  activity  spaces  in  the 
two  cases  are  different,  and  2)  the  criterion  functions  in  the  two  cases  are  different — in 
the  Ar _p  case  it  is  the  network  reward  probability  whereas  in  the  back-propagation 
case  it  is  the  total  mean-square-error  of  the  visible  unit’s  activity.  Nevertheless,  the 
relationship  between  these  two  methods  is  useful  in  understanding  the  cooperative 
interaction  that  occurs  in  AR_P  networks.  As  one  would  expect  from  this  relationship, 
the  Ar _p  method  is  slower  than  is  the  back-propagation  method  in  terms  of  the 
number  of  stimulus  pattern  presentations.  This  is  borne  out  in  the  comparative 
studies  described  in  the  next  section.  However,  the  Ar_p  method  does  not  require  a 
back-propagation  process  to  assign  credit  to  the  units.  This  could  have  advantages 
in  terms  of  hardware  implementation  and  in  terms  of  biological  plausibility.  The 
relationship  of  the  Ar_p  method  to  gradient  descent  also  suggests  a  modification  of 
the  Ar _ p  learning  scheme,  which  we  call  the  batched  AR_P  method,  that  is  described 
in  Section  4. 


Comparative  Studies  of  Layered  Network  Learning  Methods 


In  assessing  any  new  approach  to  an  old  problem,  it  is  necessary  to  compare 
the  new  method  with  ones  that  have  been  tried  before.  We  therefore  conducted 
simulation  studies  designed  to  compare  a  number  of  methods  that  have  been  proposed 
for  learning  in  layered  networks.  We  compared  eleven  such  methods  by  applying  each 
to  the  same  learning  task.  We  chose  the  multiplexer  task  (see  Section  3)  because  it 
is  difficult  enough  to  show  the  advantages  of  the  more  sophisticated  methods,  but  it 
is  simple  enough  that  reasonable  amounts  of  CPU  time  are  required  for  statistically 
significant  comparisons.  In  this  section  I  review  the  results  obtained.  Complete 
details  are  available  in  Ref.  [4]  from  which  this  section  is  abstracted. 

In  the  experiments  to  be  described,  the  hidden-unit  learning  rule  is  the  primary 
variable.  The  learning  rule  for  the  output  unit  is  the  same  for  most  experiments. 
The  perceptron  learning  rule  [42]  is  used  for  the  output  unit  since  it  is  well-known 
and  is  relatively  insensitive  to  the  learning  rate  parameter  p  (so  p  would  not  have  to 
be  varied  to  optimize  performance).1  The  network  structure  is  as  in  Fig.  10.  A  step 
in  the  simulation  of  this  system  consists  of  the  following.  An  input  vector  is  selected 
by  choosing  one  randomly,  without  replacement,  from  the  set  of  all  input  vectors. 
Upon  receipt  of  an  input  vector,  the  outputs  of  the  hidden  units  are  calculated, 
followed  by  the  calculation  of  the  output  of  the  output  unit.  The  output  of  the 
output  unit  is  subtracted  from  the  desired  output  .  This  error  controls  the  perceptron 
learning  rule  as  it  is  applied  to  the  weights  of  the  output  unit,  after  which  the 
particular  learning  method  being  tested  in  the  hidden  units  is  applied  to  the  hidden 

'The  application  of  the  error  hack-propagation  method  ' 44 1  to  the  hidden  units  requires  the  use  <>f 
a  differentiable  output  function  in  the  output  unit,  so  a  semilincar  output  function  and  learning  rule 
were  used  in  the  output  unit  for  the  experiments  with  the  error  back-propagation  method. 


units’  weights  (although  some  methods,  such  as  the  direct-search  methods,  do  not 
change  the  weights  of  the  hidden  units  on  every  step).  This  completes  one  step  in 
the  simulation.  Every  input  vector  is  presented  once  during  the  first  64  steps,  and 
once  again  for  every  subsequent  set  of  64  steps,  where  the  order  of  presentation  is 
determined  randomly. 

The  direct-search  methods  are  presented  first.  These  methods  require  no  knowl¬ 
edge  about  the  network  other  than  the  number  of  hidden-unit  weights  and  their 
ranges  of  values.  Following  the  direct-search  methods,  several  error  back-propagation 
methods  are  presented  that  involve  the  propagation  of  the  output  unit’s  error  to  the 
hidden  units.  Several  reinforcement-learning  methods  are  then  presented,  including 
the  Ar-p  method.  A  modification  of  one  reinforcement-learning  rule  is  considered 
that  generates  localized  reinforcements  to  the  hidden  units  by  propagating  informa¬ 
tion  from  the  output  unit  back  to  the  hidden  units.  Finally,  a  mechanism  is  consid¬ 
ered  that  treats  hidden  units  that  have  not  yet  acquired  a  substantial  influence  on 
the  output  unit  differently  from  those  that  have  developed  influence. 

The  behavior  of  each  method  depends  on  several  parameters.  A  comparative 
study  should  guarantee  that  parameter  values  are  used  that  are  optimal  for  a  given 
method  to  ensure  the  absence  of  bias  in  favor  of  one  method  over  another.  However, 
the  time  required  to  simulate  the  learning  process  in  these  experiments  prohibited  a 
thorough  optimization  of  the  parameter  values.  We  were  able  to  test  an  average  of  six 
different  values  over  a  broad  range  for  each  parameter,  and  when  a  method  depends 
on  more  than  one  parameter,  only  one  parameter  was  varied  at  a  time.  Note  that 
this  attempt  to  compare  methods,  where  each  is  operating  with  optimal  parameter 
values,  does  not  address  the  important,  issue  of  the  relative  degree  of  robustness  of 
the  methods.  Since  the  parameter  values  that  are  optimal  depend  on  the  learning 
task,  it  is  possible  that  a  learning  method  may  excel  at  a  particular  task  when  using 
specific  parameter  values  and  yet.  perform  badly  on  another  task  when  using  those 
same  parameter  values.  On  the  other  hand,  a  method  that  learns  more  slowly  than 
other  methods  on  a  specific  task  may  have  a  speed  advantage  over  the  other  methods 
when  applied,  with  the  same  parameter  settings,  to  a  class  of  tasks.  The  comparative 
studies  reported  here  do  not.  address  this  important  issue. 
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Another  important  issue  that  is  not  addressed  by  these  studies  is  the  issue  of 
scale-up.  How  do  learning  times  grow  as  tasks  get  larger  or  more  difficult?  We  did 
not  apply  the  battery  of  learning  methods  to  a  series  of  increasingly  difficult  learning 
tasks. 

Direct-Search  Methods 
Unguided  Random  Search 

The  simplest  possible  brute-force  random  search  was  included  to  provide  some 
idea  of  how  difficult  the  test  learning  task  is.  This  method  consists  of  randomly 
choosing  new  weight  values  for  all  of  the  hidden  units  in  the  network  (using  uniform 
probability  density  function);  evaluating  these  weights  by  allowing  the  network  to 
interact  with  its  environment  for  a  number  of  steps  (denoted  n);  and  remembering 
after  each  evaluation  period  the  weight  values  receiving  the  best  evaluation  so  far. 
Here  we  want  to  evaluate  the  current  values  of  the  weights  by  measuring  how  well  the 
network  can  solve  the  task  using  the  given  weight  values.  The  output  unit  continues 
to  learn  while  the  weights  of  the  hidden  units  are  held  constant.  The  weights  of  the 
output  unit  are  set  to  zero  whenever  new  values  for  the  hidden  units’  weights  are 
generated. 

The  unguided  random  search  was  tested  on  the  multiplexer  task  for  several  values 
of  n.  For  each  value  of  n,  the  results  from  10  runs  of  300,000  steps  were  collected. 
The  final  performance  level  of  a  run,  u,  is  the  number  of  input  vectors  for  which 
the  network  is  incorrect  when  using  the  best  set  of  weights  found  on  that  run  (so 
0  v  '  64  and  a  purely  random  strategy  of  generating  outputs  would  result  in  an 
average  value  of  32). 

In  addition  to  the  performance  level  at  the  final  step  of  each  run,  we  determined 
t  he  value  of  a  measure  of  cumulative  performance,  //,  which  for  a  single  run  is  the  sum 
of  the  number  of  errors  made  on  every  step.  For  a  nonlearning,  random  strategy, 
errors  would  occur  on  an  average  of  half  of  the  steps,  producing  a  value  for  //  of 


The  results  of  the  experiments  are  listed  in  Table  1,  including  the  99%  confidence 
intervals  of  v  and  fx.  The  unguided  random  search  performed  better  than  a  nonlearn¬ 
ing,  random  strategy  for  all  values  of  n  that  were  tried.  The  value  of  fx  consistently 
declines  as  the  parameter  n  increases.  Recall  that  after  every  n  steps,  a  new  weight 
vector  is  generated  that  does  not  depend  on  previously-tested  vectors,  so  there  is  no 
gradual  improvement  in  performance  as  a  run  progresses.  However,  since  the  output 
unit  is  learning  throughout  each  n  step  period,  larger  values  of  n  result  in  better 
performance  at  the  end  of  the  n  step  period  and  better  average  performance  over 
that  period,  which  explains  the  inverse  relationship  of  /z  and  n. 

Table  1:  Unguided  Random  Search  on  the  Multiplexer  Task 


n 

V 

50 

25.6 

± 

2.78 

140,  228  ± 

263 

100 

22.7 

± 

2.88 

134,397  ± 

128 

200 

23.4 

± 

3.80 

127,913  ± 

237 

400 

18.0 

dt 

3.21 

122,  209  ± 

176 

800 

16.5 

± 

2.66 

117,  899  ± 

342 

1600 

15.7 

± 

3.22 

115,  099  =t 

324 

3200 

18.4 

± 

5.23 

112,  848  ± 

462 

6400 

17.0 

± 

4.29 

112,  577  ± 

803 

12800 

16.9 

± 

3.96 

111,  477  ± 

1,064 

25600 

17.7 

± 

3.71 

110,  654  ± 

1 , 535 

The  values  of  v  do  not  show  that  any  one  value  of  n  is  optimal.  When  n  is  200 
or  less,  significantly  higher  values  of  u  are  obtained  than  when  n  is  400  or  greater. 
In  fact,  for  n  <  200,  performance  is  not  significantly  different  from  that  of  a  single 
layer,  for  which  u  «  24. 

A  learning  curve  for  the  unguided  random  search  on  the  multiplexer  task  was 
obtained  by  choosing  the  best  value  of  n.  which  is  1600,  and  performing  30  runs  of 
300,000  steps  each.  This  resulted  in  performance  measures  of  u  17.0  !  2.93  and 
(i  115,062  +  229  and  the  learning  curve  in  Fig.  14  (the  upper-most  curve).  On 
this  and  all  subsequent  graphs,  an  initial  rapid  drop  appears  from  0.5  errors  per  step 
to  approximately  0.37  or  0.38.  This  is  caused  by  the  output  unit  learning  as  many 
correct  responses  as  possible  without  using  hidden  units;  a  single  unit  given  the  input 
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vectors  for  the  multiplexer  task  can  learn  the  correct  output  for  40  of  the  64  input 
vectors,  resulting  in  an  average  of  0.375  errors  per  step. 

Guided  Random  Search 

There  are  obviously  many  ways  of  improving  the  unguided  random  search,  all  of 
which  involve  generating  weight  vectors  that  depend  on  the  currently-best  vector  (or 
on  a  series  of  best  vectors).  We  studied  two  methods:  a  guided  random  search  and 
the  polytope  method  described  below.  The  guided  random  search  differs  from  the 
unguided  random  search  only  in  the  manner  of  generating  new  weight  vectors.  Rather 
than  being  chosen  according  to  a  uniform  probability  density  function,  weight  vectors 
are  chosen  from  a  unimodal  probability  density  function  (defined  below)  centered  on 
the  weght  vector  that  is  currently  the  best.  This  density  function  is  symmetric 
about  the  currently-best  vector,  and  the  probability  of  selecting  vectors  decreases  as 
the  Euclidean  distance  from  the  currently-best  vector  increases.  We  used  a  density 
function  based  on  the  logistic  distribution  (see  Ref.  [4]  for  details).  The  method 
depends  on  two  parameters:  the  number  of  steps  between  the  generation  of  weight 
vectors,  n,  and  the  spread  of  the  density  function,  r. 

As  stated  earlier,  the  amount  of  computer  time  required  to  perform  these  exper¬ 
iments  prevented  a  systematic  search  for  the  optimal  values  of  n  and  r.  However, 
we  did  perform  two  unidimensior.il  searches  by  holding  r  2  while  varying  n,  then 
varying  t  while  holding  n  at  the  value  resulting  in  the  best  performance.  For  each 
parameter  setting,  results  were  averaged  over  10  runs  with  each  run  lasting  300,000 
steps.  The  results  in  Table  2  show  that  intermediate  values  of  n  are  required  to 
achieve  good  performance.  However,  unlike  the  results  for  the  unguided  random 
search,  the  cumulative  performance  measure,  fi,  also  has  a  U-shape  as  n  increases, 
providing  evidence  of  a  tradeoff  between  learning  in  the  output  unit,  (large  n)  and 
optimizing  the  weights  of  the  hidden  units  by  making  more  trials  (small  «). 

Performance  as  a  function  of  r  also  has  a  U-shape  t  here  appears  to  be  an  opt  imal 
value  of  r  in  the  range  of  0.5  to  2  (as  the  value  of  r  increases,  t  he  probability  density 
function  approaches  a  uniform  density  function,  and  the  behavior  of  guided  random 
search  approaches  that,  of  the  unguided  random  search.  The  learning  curve  in  Fig.  14 


Table  2:  Guided  Random  Search  on  the  Multiplexer  Task 


n 

L ' 

A 

i 

T 

V 

A* 

50 

27.2 

± 

3.76 

138,978 

\± 

:  898 

0.1 

18.9 

± 

4.91 

106, 894 

± 

6,204 

too 

24.1 

± 

3.61 

131,957 

± 

2, 102 

0.2 

17.1 

± 

2.53 

109,454 

± 

4,897 

200 

18.4 

± 

3.94 

124,089 

± 

1,345 

0.5 

14.9 

± 

4.13 

105,  343 

± 

6,  124 

400 

13.8 

± 

3.76 

115,390 

± 

3,271 

1.0 

11.4 

± 

3.58 

102,583 

± 

6,  128 

800 

13.3 

± 

4.73 

111,205 

± 

3,851 

2.0 

12.5 

± 

3.94 

106,818 

± 

2,  524 

1600 

13.1 

± 

4.21 

106, 544 

± 

3,314 

4.0 

15.6 

± 

2.83 

108,  128 

± 

2,797 

3200 

12.5 

± 

3.94 

106,818 

± 

2,524 

8.0 

15.0 

± 

4.06 

108,498 

± 

3,066 

6400 

16.5 

± 

4.48 

108,225 

± 

2,413 

12800 

17.6 

± 

5.29 

108,620 

± 

3,615 

T  =  2  n  =  3200 


was  produced  by  averaging  30  runs  of  300,000  steps  each,  using  n  =  3200  and  r  =  1. 
The  resulting  performance  levels  are  u  =  13.1  ±  2.36  and  //  =  103,866  ±  3420. 


The  Polytope  Algorithm 

Another  method  for  directly  searching  the  weight  space  is  the  Polytope  Algorithm 
|2l).  This  method  is  often  called  the  “simplex”  method,  not  to  be  confused  with  the 
simplex  method  for  linear  programming.  The  polytope  algorithm  is  a  deterministic 
hillclimbing  method  that  maintains  a  list  of  m  weight  vectors,  ordered  according 
to  their  evaluations.  The  m  weight  vectors  are  treated  as  vertices  of  a  polytope  in 
rn  1  -dimensional  space,  and  new  vectors  are  generated  in  a  fashion  designed  to  shift 
the  polytope  towards  an  optimum  weight  vector,  taking  large  steps  when  progress 
is  being  made  in  improving  the  evaluation  and  taking  smaller  steps  when  it  appears 
that  the  optimum  has  been  approached.  Since  this  is  a  deterministic  hillclimbing 
method,  it  can  get  stuck  at  a  local  optimum,  but  it  is  good  at  following  ravines.  We 
included  it  in  our  study  as  an  example  of  a  reasonably  sophisticated,  deterministic, 
direct-search  algorithm  to  complement  the  random  methods  presented  above. 

The  polytope  algorithm  depends  on  the  parameter  rn,  the  number  of  weight 
vectors  maintained  as  vertices  of  the  polytope,  and  the  parameter  n,  the  number 
of  steps  over  which  each  v/eight  vector  is  evaluated.  Other  parameters  are  pr, 


and  pc,  which  determine  the  lengths  of  reflection,  expansion,  and  contraction  steps, 
respectively.  Valid  ranges  for  these  parameters  are  pT  >  0,  pe  >  1,  and  0  ^  pr  <  1. 
To  reduce  the  number  of  experiments  to  a  practical  level,  we  did  not  attempt  to  find 
optimal  values  for  pr ,  pe,  and  pc ,  but  set  them  to  reasonable  values.  We  did  vary  m 
and  n,  as  shown  in  Table  3.  The  value  of  m  was  fixed  at  20  while  n  varied,  after 
which  n  was  fixed  at  1600,  which  gave  the  best  value  of  u,  while  m  was  varied.  The 
results  are  again  averages  over  10  runs  at  300,000  steps  per  run. 

Table  3  suggests  that  the  optimum  value  of  n  is  between  400  and  3,200.  The 
results  are  even  less  conclusive  about  the  optimum  value  of  m;  additional  runs  must 
be  made  to  obtain  performance  averages  with  less  variance.  The  values  n  =  1600 
and  m  -  10  were  used  in  30  runs  of  300,000  steps  to  obtain  the  learning  curve  in 
Fig.  14,  resulting  in  v  —  14.2  ±  2.09  and  p  =  94,977  ±  3079. 


Table  3:  Polytope  Algorithm  on  the  Multiplexer  Task 


n 

V 

A 

i 

200 

20.8 

7 

4.04 

118,780 

±" 

6,  537 

400 

17.8 

± 

4.33 

105,024 

± 

4,442 

800 

13.0 

± 

3.82 

99,  575 

± 

5,319 

1 ,000 

12.0 

± 

2.70 

102,449 

± 

3,054 

3,200 

14.2 

i 

2.70 

109,711 

± 

1,460 

0,400 

15.7 

± 

3.74 

110,800 

± 

2,058 

12,800 

19.0 

± 

3.93 

110,  860 

± 

2,488 

m  -  20 


m 

V 

3 

1777778 

100,  040  ±  0,  707 

5 

17.0  ±  4.51 

90, 223 ±  4,441 

10 

12.1  ±  1.97 

94,  157  ±  4,  105 

15 

15.9  i.  0.15 

102,  793  ±4,071 

20 

12.6  1  2.70 

102,  449  ±  3,654 

25 

14.7  ±  4.24 

107,  972  ±  2,447 

n  - 

1000 

None  of  the  direct-search  methods  were  able  to  solve  the  multiplexer  task  within 
the  allotted  300,000  steps.  The  unguided  random  search  showed  no  improvement 
over  time  because  the  weight  vectors  being  tested  were  not  dependent  on  previous 
search  steps.  Its  fina.  performance  level  is  slightly  bettor  than  that,  of  the  single-layer 
system  (u  17  versus  u  24).  The  guided  random  search  does  show  improvement 
over  time,  though  its  learning  curve  becomes  approximately  flat  early  in  the  runs. 
Averaged  over  the  last  3,000  steps  of  every  run,  the  number  of  errors  per  siep  is 
j  approximately  0.35.  The  polytope  algorithm  performs  better  than  both  random 

'  search  methods,  reaching  an  average  over  t  he  last.  3,000  steps  of  0.28  errors  per  step. 

» 

t 

\ 


40 


Error  Back-Propagation  Methods 

Next  we  discuss  some  error  back-propagation  methods  for  learning  in  hidden 
units,  starting  with  a  method  studied  by  Rosenblatt  [42]. 

Rosenblatt’s  Back-Propagation  Method 

Rosenblatt  is  known  for  his  work  with  the  perceptron-family  of  learning  rule, 
but  his  error  back-propagation  method  has  received  little  attention.  Since  this  was 
proposed  early  in  the  history  of  research  on  learning  in  multilayer  systems  and  seemed 
to  work  reasonably  well  for  the  experiments  Rosenblatt  performed,  we  wished  to 
include  it  in  our  study.  Rosenblatt’s  back-propagation  method  is  a  nondeterministic 
way  to  assign  errors  to  hidden  units  based  on  the  errors  of  output  units.  The  following 
is  our  specification  of  Rosenblatt’s  back-propagation  method: 

1.  Initialize  all  weights  to  zero. 

2.  Receive  input  vector,  calculate  the  output  of  all  units  using  a  linear  threshold 
function,  and  receive  error  signals  for  the  output  units. 

3.  Apply  the  perceptron  learning  rule  to  the  output  units. 

4.  Calculate  the  error,  bjk,  passed  back  from  output  unit  k  to  hidden  unit  j 
(probabilistically  based  on  the  output  unit’s  error,  the  weight  connecting  unit 
/  to  unit  k ,  and  the  output  of  unit  j). 

t’fcjt]  random  variable  from  a  uniform  probability  density 

function  over  [0,  li,  where  k  <-  O  takes  the  values  of  the 
indices  of  the  output  units. 


1,  if  y,\t\ 

1  and 

y.,1'1)  u'jk\t\  ■ 

0  and  vk\t\ 

P  i 

C  if  .V,  1*| 

0  and  {<l.}\t\ 

Ujl'  l  )  I 

•  0  and  vk\t\  ■ 

Pi 

or 

'f  ?/;[*!  r 

-  0  and  ( <7j  | / 1 

-  .Vj  1  * i )  »’,*[*]  ■ 

0  and  vk\t\  ■ 

P:>. 

0,  otherwise. 
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5.  Apply  the  perceptron  learning  rule  to  each  hidden  unit  j,  using  the  sign  of  the 
sum  of  the  back-propagated  errors  from  the  output  units  as  the  error  signals: 

r-  p  sgn  MT 

\*eo 

6.  Repeat,  starting  at  Step  2,  until  the  prespecified  number  of  time  steps  has 
elapsed. 

Rosenblatt’s  back-propagation  method  depends  on  the  parameter  p,  a  factor  de¬ 
termining  the  magnitude  of  change  for  each  weight,  and  the  parameters  p{ ,  p2,  and 
p3,  which  are  probabilities  affecting  the  frequencies  with  which  the  back-propagated 
error  signals  take  the  values  + 1  and  -1.  Rosenblatt  performed  a  number  of  ex¬ 
periments  and  determined  that  the  values  p i  =  0.9,  p2  =  0.3,  and  p3  —  0.1  were 
reasonable  values.  Rather  than  attempting  to  optimize  all  four  parameters,  we  used 
these  values  for  pt,  p2,  and  p3  for  all  experiments,  only  varying  the  value  of  p. 

The  results  are  in  Table  4.  There  are  few  significant  differences  for  different 
values  of  p,  although  values  from  0.125  to  0.5  resulted  in  slightly  lower  values  of 
u.  The  values  of  u  and  p  show  no  improvement  over  a  single-layer  system.  Indeed, 
the  learning  curve  for  Rosenblatt’s  method  in  Figs.  14  and  4  shows  no  improvement 
over  time  and  is  always  worse  than  the  single-layer  level.  The  learning  curve  is 
averaged  over  30  runs  of  300,000  steps  each,  giving  values  of  u  23.9  h  1.58  and  p 
121,  115  f  92.  To  judge  the  performance  of  Rosenblatt’s  back-propagation  method 
fairly,  additional  values  of  pi,  p2,  and  p3  must  be  tested. 

Rumelhart,  Hinton,  and  Williams 

Another  approach  to  the  bark-propagation  of  errors  was  taken  by  Rumelhart, 
Hinton,  and  Williams  144,.  Our  specialization  of  this  method  to  the  two-laver  multi¬ 
plexer  network  is  as  follows: 

1.  Randomly  initialize  all  weights  to  be  in  the  interval  i  O.I.O.lj. 
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2.  Receive  input  vector,  calculate  output  of  all  units,  and  receive  error  signals  for 
the  output  units.  All  units  use  the  semilinear  output  function: 


1 


vM  = 


1  +  e  •=° 


3.  Calculate  6 k  for  each  output  unit  k  G  O: 


=  «(<]  -  yic\t\)  yk\t\  (l  -  y*|<|) , 


where  d'k  is  a  modified  version  of  the  desired  output,  defined  as 


,  _  .  0.9,  if  dk\t\  -  1; 

dk[t]  ~  0.1,  if  dk\t\  -  0. 


4.  Apply  the  learning  rule  to  the  weights  of  each  output  unit  k: 


Aw]k\t}  =  p6k\t\xj\t\  +  pmAw}k\t  -  1), 


where  x}\t\  is  an  input  component  received  by  output  unit  k.  Recall  that  the 
output  units  receive  the  original  input  terms  to  the  system  plus  the  output  of 
the  hidden  units. 


5.  Calculate  6,  for  each  hidden  unit  j: 


Si\t\  =  yAl\  (i -%■(<!)• 

6.  Apply  the  learning  rule  to  the  weights  of  each  hidden  unit  j: 

Awi}\t}  =:  p  6j\t\ x,\t]  +  pm  AwtJ\f  1|. 

7.  Repeat,  starting  with  Step  2,  until  the  prespecified  number  of  time  steps  have 
elapsed. 


By  adding  a  fraction  of  the  previous  Au>  to  the  current  weight  change,  (Steps  4 
and  6),  it  is  hoped  that  the  weight  values  will  be  more  likely  to  follow  the  slope  of 
the  error  function  at  the  bottom  of  steep  valleys,  by  canceling  opposing  steps  up 
one  side  or  the  other.  Rumelhart  et  al.  consider  this  additional  term  as  affecting  the 
“momentum”  of  the  trajectory  of  weight  values.  The  method  has  two  parameters: 
the  rate  of  change  parameter  p  and  the  factor  pm  that  controls  the  magnitude  of  the 
momentum  ter  n.  Table  5  shows  the  values  of  p  and  pm  that  were  tested  and  the 
results  averaged  over  10  runs  of  100,000  steps  each. 

Note  the  modification  of  the  desired  output  value  in  Step  3.  Rather  than  values 
of  1  and  0,  values  of  0.9  and  0.1  are  used.  Without  this  modification,  weight  values 
can  grow  in  magnitude  to  the  point  where  truncation  errors  due  to  the  particular 
computer  implementation  can  cause  weight  values  to  become  frozen — the  value  of 
y(l  y)  in  the  weight  update  equation  becomes  equal  to  zero. 


Table  5:  Rumelhart  et  al.  Error  Back-propagation  Method  on  the  Multiplexer 
Task 


p 

V 

P 

0.05  ' 

15188  ;  HI 

19.8  1  0  84 

0.10 

11716  1  1002 

11.7  t  1.10 

0.25 

14144  1  1426 

0.1  1  0  55 

0.50 

6066  1  1052 

0.1  1  0.19 

1.00 

4044  1  1224 

0.7  1  0.19 

2.00 

1289  1  915 

0.2  1  0.52 

4.00 

1294  1  816 

0.2  i  0.52 

8.00 

11446  ±  4097 

6.6  1  2.91 

16.00 

52422  ±  5497 

P«,  o 

18.1  1  1.12 

P 

V 

P 

P 

i / 

P 

0.05 

.14976  1  496 

18.9  1  1.97 

0.05 

61  SO  1  .149 

0.1  ±  0.26 

0.10 

11218  1  1671 

15.9  1  4.57 

0  10 

■1207  t  454 

0.0  1  0.00 

0.25 

26245  1  7154 

9.1  1  7.79 

0.25 

1747 1  480 

0.0  ±  0.00 

0.50 

1  1287  t  2562 

0.1  10.26 

0.50 

1492  1  844 

0.2  ±  0.52 

1.00 

1816  1  869 

0.2  .1  0.52 

1.00 

5802  1  2686 

1.9  ±  1.86 

2.00 

1267  1  1 229 

1.0  1  0.86 

4.00 

8905  i  2211 

1.5  1  1.55 

r,n  0.5 
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The  output  value  of  a  semilinear  unit  is  a  real  value  between  0  and  1.  To  compare 
with  the  other  methods  that  use  binary- valued,  linear  threshold  units  as  the  output 
unit,  the  output  of  the  output  unit  k  was  set  to  1  if  t/t  >  0.5  and  was  otherwise  set  to 
0  while  calculating  p  and  u  and  the  learning  curve.  This  is  only  done  in  measuring 
performance,  not  in  actually  running  the  learning  method. 

From  Table  5  one  can  see  that  this  error  back-propagation  method  reliably  solved 
the  multiplexer  task  within  100,000  steps,  for  p  =  0.1  and  0.25  and  pm  =  0.9.  For 
p  =  0.25  only  1,747  errors  were  accumulated  over  100,000  steps  (p  —  1,747).  Best 
performance  (considering  both  i>  and  p)  resulted  when  p  =  0.25  and  pm  —  0.9. 
These  parameter  values  were  used  to  generate  the  learning  curve  shown  in  Fig.  4 
and  in  Fig.  14,  averaged  over  30  runs  of  300,000  steps  each.  The  curve  shows  that 
extremely  good  performance  is  achieved  very  early  in  the  runs;  as  early  as  6,000 
steps  the  average  number  of  errors  per  step  is  below  0.06.  The  performance  measures 
associated  with  this  learning  curve  are  v  —  0.00  ±  0.00  and  p  =  1,962  ±  148. 

The  third  curve  in  Fig.  4  shows  the  results  of  an  experiment  designed  to  test  a 
modification  to  Rumelhart  et  al.’s  method  proposed  by  Sutton  [48].  He  suggested 
that  it  is  the  sign  of  the  weight  value  appearing  in  the  expression  in  Step  5  above 
that  is  the  important  contribution  of  the  weight,  and  that  the  magnitude  of  it  might 
hamper  the  method’s  progress,  particularly  when  the  magnitude  is  very  small.  We 
tested  this  hypothesis  by  replacing  w with  the  sign  of  tu,-*,  resulting  in  a  new 
expression  for  £,[<]: 

biV ]=  f  H  sgn(w, *[<[)]  y,\t\  (1  -  y, ■[<]). 

Wo  I 

As  before,  we  varied  p,  with  the  results  shown  in  Table  6  which  are  averaged  over 
10  runs  of  100,000  steps  each.  The  best  value  of  p  is  still  0.25  and  for  pm  it  is  0.9. 
The  results  averaged  over  30  runs  of  300.000  steps,  using  these  parameter  values,  are 
v  0.00  t  0.00  and  p  -  1,354  ±  575,  and  the  learning  curve  is  shown  in  Fig.  4. 
The  modification  appears  to  retard  the  method’s  initial  progress,  but  the  task  is  still 
reliably  solved.  The  cumulative  error  measure,  //,  is  not  significantly  different  from 
that  of  the  unmodified  Rumelhart  method.  The  modified  method  does  appear  to  be 
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more  robust  than  the  unmodified  method;  the  task  is  reliably  solved  (u  =  0.00)  for 
a  wider  range  of  parameter  values. 


Table  6:  Sutton’s  Modification  of  the  Error  Back-propagation  Method  on  the 
Multiplexer  Task 


p 

V 

P 

0.10 

27759  ±  3438 

7.2  ±  3.74 

0.25 

11594  ±  958 

0.1  ±0.26 

0.50 

5846  ±  1370 

0.0  ±  0.00 

1.00 

3013  ±  573 

0.1  ±  0.26 

2.00 

2336  ±  355 

0.1  ±0.26 

4.00 

4378 ±  1179 

1.0  ±0.86 

Pm  =  0 


P 

V 

P 

P 

V 

P 

0.10 

16447  ±  2504 

1.6  ±  1.45 

0.05 

5091  ±  538 

0.0  ±  0.00 

0.25 

5427  ±  447 

0.0  ±  0.00 

0.10 

2411 ±  310 

0.0  ±  0.00 

0.50 

2742  ±  369 

0.0  ±  0.00 

0.25 

1310  ±  404 

0.0  ±  0.00 

1.00 

1536  ±  192 

0.0  ±  0.00 

0.50 

1353 ±  796 

0.2  ±  0.00 

2.00 

4.00 

2173  ±  767 
8524 ±  1175 

0.1  ±0.26 

3.6  ±  0.96 

1.00 

2968  ±  1309 

0.7  ±0.77 

Pm  =  0.5 


Pm  =  0.9 


Associative  Reinforcement  Learning 

Four  associative  reinforcement-learning  methods  were  studied,  two  being  vari¬ 
ants  of  one  of  the  others.  Barto  and  colleagues  have  developed  several  associative 
reinforcement-learning  methods  [7,6,14,47].  Sutton  [47]  compared  a  number  of  these 
methods.  For  tasks  most  similar  to  those  faced  by  hidden  units  in  the  networks  ap¬ 
plied  to  the  multiplexer  task,  Sutton  found  that  a  particular  learning  rule,  which  we 
will  call  “associative  search  with  reinforcement  prediction,”  or  AS-RP,  rule  performed 
better  than  others. 
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Associative  Search  with  Reinforcement  Prediction 

The  AS-RP  method  employs  an  additional  unit  that  adjusts  its  weights,  v,  in  order 
to  match  as  closely  as  possible  the  value  of  the  reinforcement  expected  for  acting  in 
the  presence  of  each  input  vector.  This  provides  the  hidden  units  with  a  “reference” 
signal  to  which  the  current  reinforcement  can  be  compared  to  determine  whether 
it  is  greater  or  less  than  the  reinforcement  usually  received  when  given  the  current 
input  vector.  One  can  think  of  this  extra  unit  as  a  predictor  of  the  reinforcement  to 
be  received.  The  AS-RP  method  is  defined  as  follows: 

1.  Initialize  all  weights  to  zero. 

2.  Receive  input  vector,  calculate  output  of  all  units,  and  receive  error  signals  for 
the  output  units.  The  output,  y;,  of  hidden  unit  j  is  given  by: 

6 

1,  if£u>tJ(«]x,(<)  +  rij[t\  >  0; 
yM  =  i= o 

0,  otherwise, 

where  the  r/yjt]  are  sequences  of  random  variables  with  density  function 

3.  Apply  the  perceptron  learning  rule  to  the  output  units. 

4.  Calculate  the  reinforcement  signal  for  the  hidden  units: 

rl*l  =  1  -  |q|  E  Kl*l  -  y>l<ll- 

where  m  is  the  number  of  output  units.  Since  m  —  1,  rjt]  £  {0,  I}. 

5.  Calculate  the  prediction  of  reinforcement,  r|M  as  follows: 

n 

rr\t\  r, 

1  =  1 

where  n  is  the  number  of  input  components  to  the  system  an  d  v,  [ /  |  is  the 
predictor-unit’s  weight  associated  with  input  component  x,|<]. 


6.  Apply  the  associative  search  rule  to  hidden  unit  j: 


=  p  (»•[*)  -  rpi*l)  (y>l*l  --  *>1*1)  *4*1. 


where 

TTj\t\  =  E  {y>l*lKl*];a:l<l}, 

which  is  the  expected  value  of  the  output  y;  of  unit  j,  given  its  current  weight 
values  and  input.  Since  y;  E  {0, 1},  *j  is  the  probability  that  y,  =  1. 

7.  Update  the  predictor’s  weights. 

Au,l<)  =  pp  (r[<|  -  rp[<])  xf\t\. 

8.  Repeat,  starting  with  Step  2,  until  the  prespecified  number  of  time  steps  have 
elapsed. 

Two  parameters  control  this  method:  the  rate  of  change  in  modifying  the  hidden 
units’  weights  is  p,  and  the  rate  of  change  in  modifying  the  reinforcement  predictor’s 
weights  is  pv.  Five  values  of  p  were  tried  while  pp  was  set  to  one  of  three  values.  For 
each  set  of  parameter  values,  10  runs  were  made  of  300,000  steps  each. 

The  results  in  Table  7  show  that  the  AS-RP  method  did  not  completely  solve 
the  multiplexer  task,  but  for  p  =  0.16  and  pp  =  0.01  the  value  of  u  was  about 
2.8,  meaning  that  after  300,000  steps  an  average  of  only  2.8  out  of  64  input  vectors 
resulted  in  an  incorrect  output.  The  performance  of  the  AS-RP  method  over  time  is 
shown  by  its  learning  curve  in  Figure  13.  The  learning  curve  is  averaged  over  30  runs 
using  p  —  0.16  and  pp  =  0.01,  and  resulted  in  u  =  3.36±  1.98  and  p  =  48, 754  ±  8,662. 
Better  performance  might  be  attainable  by  testing  additional  parameter  values. 

Another  way  to  improve  this  method’s  performance  is  to  include  the  output  of 
the  hidden  units  in  the  set  of  input  components  to  the  reinforcement-predictor  unit. 
It  may  be  impossible  for  a  single  unit  to  implement  an  accurate  mapping  from  input 
vectors  to  reinforcement  values,  just  as  it  is  impossible  for  a  single  unit  to  implement 
a  multiplexer  function.  This  possibility  was  not  tested. 
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Associative  Reward-Penalty 

The  second  method  from  the  reinforcement-learning  class  that  we  studied  is  the 
Associative  Reward-Penalty,  or  Ar_p  ,  learning  rule  described  in  Section  2. 

1.  Initialize  all  weights  to  zero. 

2.  Receive  input  vector,  calculate  output  of  all  units,  and  receive  error  signals  for 
the  output  units.  Output  functions  are  those  used  for  the  AS-RP  method. 

3.  Apply  the  perceptron  learning  rule  to  the  output  units. 

4.  Calculate  the  global  reinforcement  signal  for  the  hidden  units. 

rf*l  =  1  -  ,I|  YL  -  yjl*)l- 

;eo 

For  the  multiplexer  task,  |0|  =  1,  so  r[<]  £  {0, 1},  but  in  general  r [< )  £  [0,  l]. 

5.  Apply  the  AR  p  rule  to  each  hidden  unit. 

6.  Repeat,  starting  with  Step  2,  until  the  prespecified  number  of  steps  have 
elapsed. 

The  Ar  _p  method  depends  on  two  parameters.  The  rate  of  weight  change  is 
controlled  by  p  and  A.  If  A  =  0,  no  change  is  made  to  the  weight  values  when  the 
“penalty”  signal  r\t\  =  0  is  received.  See  Section  2. 

Table  8  contains  the  results  of  the  Ar_p  method  on  the  multiplexer  task,  averaged 
over  10  runs  of  300,000  steps  each.  Of  the  parameter  values  tested,  p  —  1  and 
A  0.004  resulted  in  the  best  performance,  solving  the  task  with  a  final  number  of 
errors  over  all  input  vectors  of  0.02. 

The  learning  curve  in  Fig.  13  shows  that  the  Ar_p  method  performed  much  better 
than  the  AS-RP.  Averaged  over  30  runs  of  300,000  steps  each,  and  using  p  1  and 
A  -  0.004,  the  Ar  P  rule  resulted  in  v  —  0.01  ±  0.01  and  p  —  15,725  ±3,129.  The 
value  of  0.01  ±  0.01  for  u  indicates  that  the  solution  to  the  multiplexer  task  was 
reliably  found. 
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Table  8:  Ar_p  Method  on  the  Multiplexer  Task 


A 

1/ 

0.001 

0.52  ±  1.30 

24, 960  ±  10,347 

0.002 

0.33  ±  0.80 

25,  234  ± 

7,486 

0.004 

0.02  ±  0.01 

14, 493  ± 

5,557 

0.008 

0.01  ±  0.01 

20, 046  ± 

5,918 

0.016 

18.80  ±  6.65 

107, 729 ± 

7, 101 

0.032 

23.00  ±  3.26 

118, 806  ± 

185 

P  = 

1 

P 

V 

A* 

0.1 

1.30  ±2.02 

51, 109  ±  10, 157 

0.2 

0.08  ±  0.03 

25,  377  ± 

5,479 

0.4 

1.80  ±  4.63 

20,  301  ± 

7,050 

0.8 

0.71  ±  1.53 

17,  320  ± 

7,553 

1.0 

0.02  ±  0.01 

14, 493  ± 

5,557 

1.6 

0.01  ±0.00 

15, 167  ± 

3,908 

3.2 

11.60  ±  8.63 

81,  798  ±  16,013 

A  =  0.004 


Local  Reinforcement 

The  AS-RP  and  the  Ar_p  methods  function  in  a  “global”  reinforcement  paradigm 
in  which  each  hidden  unit  receives  the  same  reinforcement  signal.  However,  hidden 
units  in  a  multilayer  system  can  be  provided  with  more  informative  evaluation  infor¬ 
mation  than  that  provided  by  the  global  reinforcement  signal.  We  investigated  one 
possible  way  of  using  this  information  to  construct  a  unique  “local”  reinforcement 
signal  to  each  hidden  unit.  The  approach  is  similar  to  Rosenblatt’s  back-propagation 
method  in  its  division  into  several  cases  according  to  units’  outputs  and  weight  values, 
but  differs  in  that  reinforcements  are  propagated  rather  than  errors. 

Steps  1  through  3  are  identical  to  those  of  the  Ar_p  method. 

4.  Let  be  a  reinforcement  based  on  the  output  value  of  output  unit  k  and 

the  weight  connecting  hidden  unit  j  to  output  unit  k,  defined  as: 

0.5,  if  wjk\t]  -  0; 

1,  if  ^  0  and  dk\t\  -  y*|/) 

or 

y;[<]  =  1  and  w]k\t\(dk\t)  yk\t |)  >  0; 
or 

yj\t\  =  0  and  ivjk\t}{dk\t]  yfc|t|)  <  0; 

0,  otherwise. 


Calculate  the  local  reinforcement  signal  for  hidden  unit  j  as: 

r>  =  n 

keo 

though,  for  this  task  O  —  {5},  so  r;  —  r;  5. 

5.  Apply  the  Ar_p  rule  to  the  hidden  units,  now  using  separate  reinforcements 
for  each:2 

=  prj[t\  (t/j [<]  -  nj[t\)  Xi\t\ 

+Ap(l  -  r,-[t])  (1  -  y,(f]  -  n ,-[«]) 

6.  After  the  prespecified  number  of  steps  have  elapsed  the  final-step  performance 
measure  v  is  calculated. 

The  motivations  for  the  cases  in  Step  4  is  as  follows.  When  a  hidden  unit  has  no 
influence  on  the  output  unit,  i.e.,  wik  =  0,  then  no  preference  in  its  output  should  be 
revealed.  To  accomplish  this,  r]k  is  set  to  0.5  regardless  of  the  output  of  the  hidden 
unit,  the  output  unit,  and  the  correct  output.  The  second  case  is  composed  of  three 
situations.  First,  if  the  hidden  unit  does  have  a  nonzero  output  weight,  i.e.,  w]k  0, 
and  the  output  unit  generated  a  correct  response,  then  the  hidden  unit  is  “rewarded” 
by  being  assigned  a  reinforcement  value  of  1,  increasing  the  probability  of  the  output 
value  that  it  just  produced.  The  second  part  rewards  the  hidden  unit  if  its  output 
value  is  1  and  its  output  weight  has  the  same  sign  as  the  output  unit’s  error.  The 
third  part  rewards  the  unit  when  its  output  value  is  0  and  its  output  weight  differs 
in  sign  from  the  output  unit’s  error. 

This  modification  to  the  Ar_p  rule  does  not  add  any  new  parameters.  We  tried 
a  number  of  values  for  p  and  A  and  averaged  the  results  over  10  runs  of  300,000 
steps  each.  From  Table  9  we  see  that  p  —  0.5  and  A  —  0.0001  resulted  in  the  best 
value  of  v,  which  was  0.55  errors  over  the  64  input  vectors  after  300,000  steps.  The 
cumulative  measure,  p,  was  lowest  for  A  0.0005. 

2This  can  be  called  the  “S-model”  Ap  r  rule  following  the  terminology  used  in  the  study  of 
stochastic  learning  automata  [3f>|.  It  is  applicable  if  the  reinforcement  value,  r,  is  a  real  number  in 
the  interval  (0,  l|.  Note  that  it  specializes  to  the  version  of  the  Ap  -r  method  given  by  Equation  2.4 
if  the  reward  and  penalty  values  of  r  are  respectively  represented  by  1  and  0. 


Table  9:  Ar_p  with  Local  Reinforcement  on  the  Multiplexer  Task 


A 

V 

0.0 

1.32  ±  2.24 

16, 069  ±  10,750 

0.00001 

3.07  ±  2.84 

28,822  ±  15,678 

0.0001 

0.24  ±0.52 

12,  512  ± 

4,937 

0.0002 

1.07  ±  1.72 

17, 830  ± 

7,706 

0.0005 

0.39  ±0.55 

10,418  ± 

1,973 

0.001 

0.76  ±0.87 

14,  539  ± 

1,997 

0.002 

4.64  ±  5.51 

21, 145  ± 

2,  279 

0.004 

6.53  ±  4.69 

34,  506  ± 

1,613 

0.008 

10.30  ±  3.18 

64,  436  ± 

1,644 

P  =  1 


P 

V 

M 

0.01 

23.72 

±T 

3.05 

12,  2127  ± 

116 

0.25 

2.84 

± 

3.61 

40,581  ±  : 

16,852 

0.50 

0.55 

± 

1.27 

17,  109  ± 

7,221 

0.75 

1.31 

± 

1.69 

23, 290  ± 

10,555 

1.00 

0.24 

± 

0.52 

12, 512  ± 

4,937 

1.25 

1.44 

± 

2.66 

22, 467  ± 

12,298 

A  =  0.0001 


A  learning  curve  for  the  Ar_p  with  local  reinforcement,  again  averaged  over  30 
runs  of  300,000  steps  each,  is  included  in  Fig.  13.  The  values  p  =  0.6  and  A  — 
0.0001  were  used  for  the  method’s  parameters.  This  modification  to  the  Ar_p  method 
performs  slightly  better  that  the  original  Ar_p  method  before  approximately  the 
20,000th  step,  and  thereafter  its  performance  is  worse  than  that  of  the  AR_P  . 

The  local  reinforcement  addition  seems  to  help  during  the  early  stages,  but  is  a 
hindrance  throughout  the  remainder  of  a  run.  Perhaps  this  indicates  that  using  the 
information  about  the  hidden  units’  output  weights  and  the  output  units’  errors  is 
only  beneficial  while  the  hidden  units  have  minor  effects  on  the  output  unit  through 
output  weights  of  small  magnitudes.  When  output  weights  are  near  zero,  learning 
according  to  the  Ar_p  method  with  global  reinforcement  is  very  slow  because  there 
is  very  little  correlation  between  a  hidden  unit’s  output  and  the  global  reinforcement 
signal.  But  as  the  output  weights  increase  in  magnitude  they  acquire  more  of  an 
influence  on  the  global  reinforcement  and  can  begin  to  optimize  their  weight  values. 
A  more  complex  task — one  requiring  more  than  a  single  output  unit — might  demon¬ 
strate  a  greater  potential  for  using  this  scheme  for  calculating  local  reinforcement. 


Penalty  Prediction 


The  plight  of  a  hidden  unit  that  has  not  yet  acquired,  or  has  lost,  a  substantial 
influence  on  the  output  units  is  to  learn  very  slowly  if  it  is  modifying  its  weights 
through  efforts  to  increase  the  reinforcement  value.  Is  there  some  way  to  put  such 
an  “unused”  unit  to  better  use?  We  applied  a  second  extension  of  the  Ar_p  method 
to  the  multiplexer  task  to  investigate  this  question.  This  extension  is  based  on  the 
assumption  that  poor  performance  is  caused  by  the  lack  of  an  appropriate  represen¬ 
tation.  Situations  for  which  incorrect  outputs  are  generated  need  to  be  represented 
differently,  perhaps  with  additional  components,  giving  the  output  units  more  de¬ 
grees  of  freedom  with  which  they  can  alter  their  outputs. 

To  realize  this  idea  we  divide  the  learning  rule  for  the  hidden  units  into  two  parts, 
each  part  coming  into  play  at  different  stages.  When  a  hidden  unit  has  a  substantial 
effect  on  units  “downstream,”  then  the  normal  AR_P  learning  rule  is  followed.  But 
when  a  hidden  unit  does  not  significantly  influence  other  units,  the  hidden  unit 
adjusts  its  weights  in  an  attempt  to  match  them  to  input  vectors  that  result  in 
low  reinforcement  values,  in  effect  becoming  a  “penalty  predictor.”  In  this  way, 
new  features  are  introduced  that  represent  inputs  for  which  the  performance  of  the 
system  is  low.  This  is  related  to  the  data-directed  method  of  Reilly,  Cooper,  and 
Elbaum  [40],  who  dedicate  new  hidden  units  whenever  an  error  is  encountered  by 
their  system. 

The  implementation  of  this  strategy  depends  on  a  measure  of  the  degree  to  which 
a  hidden  unit  has  an  influence  on  other  units.  Since  in  this  task  a  hidden  unit  can  only 
influence  one  other  unit,  we  simply  used  the  magnitude  of  the  hidden  unit’s  single 
output  weight  as  an  indication  of  influence,  though,  as  Klopf  and  Gose  [26]  showed, 
other  measures  might  lead  to  more  accurate  indicators  of  influence.  The  magnitude 
of  a  hidden  unit’s  output  weight  is  squashed  into  the  range  [0,  1]  by  passing  it  through 
a  logistic  function.  Some  method  of  combining  the  measures  from  different  output 
weights  must  be  employed  when  the  network  has  more  than  one  output  unit.  The 
Ar-P  method  is  modified  as  follows: 

Steps  1  through  4  are  identical  to  those  of  the  Ar  _P  method. 


it.  I*.  It , 


5.  Apply  the  Ar-p  rule  with  penalty  prediction  to  the  hidden  units: 

(a)  Calculate  the  influence,  a;,  of  hidden  unit  j  on  the  output  units: 

Q?^  ~~  WJkW'  Wa  ' 

1  +  e  T<* 


(b)  Update  the  weights: 


=  a. 


n  |,i  (  pAA  Ml  -  *Al  1)  xAA 

+Ap(l-r[tl)(l-y3[<!-7rJ[<|)x,(<| 
+  (1  -  a}\t})pa  (1  -  r\t\  -  i r,-[<])  x,[<]. 


6.  Same  as  Step  6  of  the  Ar_p  method. 

The  equation  in  Step  5b  is  composed  of  two  main  parts.  The  first  part  is  the 
expression  for  the  S  version  of  the  Ar-p  rule.  Its  contribution  to  the  update  of 
weights  varies  inversely  with  that  of  the  second  part  of  the  equation.  The  second  part 
is  only  significant  when  is  small,  meaning  that  hidden  unit  j  has  little  influence 
on  the  output  unit.  It  serves  to  push  the  weight  vector  in  the  direction  of  the  current 
input  vector  when  r  is  small  and  it  pushes  the  weight  vector  away  from  the  input 
vector  when  r  is  large. 

In  addition  to  the  parameters  p  and  A  of  the  Ar  p  method,  this  modification 
depends  on  the  values  of  pa,  xva,  and  Ta,  which  have  their  strongest  effect  when  o;, 
the  influence  on  the  output  units,  is  small.  The  variable  a;  is  a  function  of  unit  j’s 
output  weights,  defined  in  such  a  way  as  to  scale  its  value  between  0  and  1;  a3  1 
when  unit  j  has  a  very  strong  influence  on  an  output  unit,  and  a3  ■=  0  when  it  has  no 
influence.  The  scaling  function  for  a  is  controlled  by  the  parameters  wa  and  r„,  and 
its  form  is  that  of  the  logistic  function,  where  r„  is  the  “spread”  of  the  function  and 
wn  is  the  value  of  its  argument  such  that  n}  =  0.5  when  w,y  For  example,  if 

w,y  -  1.5  and  r„  -  0.1  and  there  is  one  output  unit,  then  o;  will  have  the  following 
values  for  the  given  values  of  unit  j' s  output  weight: 

output  weight  (t vjk)  o ; 

0  0.000 

±  1  0.007 

±2  0.993 

13  1.000 
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For  the  multiplexer  task,  the  output  unit  learns  under  the  perceptron  learning 
constant  with  a  learning  constant  of  p  =  1.  Therefore,  the  output  unit’s  weights, 
which  are  the  output  weights  of  the  hidden  units,  will  always  be  integer-valued.  For 
wa  —  1.5  and  Ta  =  0.1,  the  AR_P  with  penalty  prediction  method  will  approximate 
the  original  AR._P  method  except  when  the  value  of  the  output  weight  is  0,  as  it  is 
initially,  or  1. 

Table  10  shows  the  results  of  testing  this  method  for  various  parameter  values 
over  10  runs  of  50,000  steps  each.  Using  the  values  p  —  1  and  A  =  0.004,  which  gave 
the  best  performance  for  the  original  AR_P  method,  we  found  that  pa  =  16,  u>a  1.5, 
and  Ta  —  0.1  were  the  best  parameter  values  tested.  Not  reported  here  are  further 
experiments  in  which  p  and  A  are  varied,  again  finding  p  —  1  and  A  -  0.004  to  be 
the  best  of  a  small  set  of  alternative  values. 

The  best  parameters  were  used  to  generate  the  learning  curve  in  Fig.  13,  averaged 
over  30  runs  of  300,000  steps  each.  The  learning  curve  shows  that  this  method 
performed  much  better  than  the  original  AR_P  .  The  AR_P  with  penalty  prediction 
resulted  in  performance  measures  of  u  —  0.08  ±0.17  and  p  =  7,411  ±  1, 773.  Roughly 
twice  as  many  errors  on  average  were  made  during  the  runs  of  the  AR_P  method 
(p  --  7,411  versus  p  —  15,  725). 

The  fact  that  large  values  of  pa  result  in  better  performance  than  small  values 
suggests  that  the  advantage  of  the  penalty  prediction  modification  is  due  to  the  size 
of  the  large  jumps  in  a  hidden  unit’s  weights  when  the  unit  has  a  small  output 
weight,  and  riot  in  the  direction  of  the  weight  change.  This  hypothesis  was  tested 
through  further  experiments,  as  follows.  The  method  was  modified  in  a  way  that 
preserved  the  size  of  the  large  weight  changes  while  removing  the  dependence  on  the 
reinforcement  signal  to  direct  the  weight  change.  Instead,  a  random  signal  guided 
the  changes  in  weight  values  when  the  output  weight  is  of  low'  magnitude.  Thus, 
Step  5b  of  the  AR  _P  with  penalty  prediction  becomes 


Ate,;!U 


pr'1'  (//,./  r,!/1 


\p{\  r±)(l 


a, ;/!)  ,r,:/| 


(1  rtj ’71)  p.,{- ±  t  n,  t  )  .r,  t  \. 


where  the  are  sequences  of  Hernoulli  random  variables  (possible  values  are  0  and 


Table  10:  Ar_p  with  Penalty  Prediction  on  the  Multiplexer  Task 


w„ 

l / 

M 

T„ 

V 

0 

0.5 

7.30  ±  5.06 

11,  246  ±  1,965 

0.01 

2.78  ±  2.66 

7,  695  ±2,  304 

1 

0.5 

23.50  ±  3.11 

20, 009  ±  223 

0.1 

0.26  ±0.52 

4,761  ±  2,225 

2 

0.5 

19.90  ±  6.85 

18,004  ±  2,032 

0.2 

0.48  ±  1.05 

8,  698  ±  3,  030 

4 

0.5 

4.70  ±  5.51 

10,  795  ±  3,  324 

0.4 

18.00  ±  5.89 

18,  542  ±  2,  108 

8 

0.5 

0.54  ±  1.21 

4,682  ±  1,498 

0.6 

23.80  ±  2.90 

20,  242  ±  100 

16 

0.5 

3.20  ±  3.06 

6,  143  ±  2,  389 

0.8 

23.20  ±  2.27 

20,  340  ±  64 

32 

0.5 

7. 10  ±  5.29 

7,  712  ±2,  734 

1.0 

23.70  ±  2.13 

20,  243  ±  74 

4 

1.0 

22.60  ±  3.35 

19,  786  ±  325 

8 

1.0 

6.20  ±  6.75 

10,  142  ±  4,  136 

16 

1.0 

0.37  ±0.52 

6,  779  ±  1,626 

p  =  1.0,  A  = 

0.004 

32 

1.0 

3.80  ±  3.85 

8,  253  ±  1,723 

p„  =  0.1,  w„ 

=  1.5 

4 

1.5 

14.70  ±  7.79 

17, 297  ±2, 781 

8 

1.5 

4.20  ±5.51 

8,301  ±  2,403 

16 

1.5 

0.26  ±  0.52 

4,  761  ±2,  225 

32 

1.5 

5.20  ±  5.52 

8,  140  ±  3,714 

4 

2.0 

25.00  ±  3.66 

20,  209  ±  53 

8 

2  0 

19.40  ±  6.28 

19,  185  ±  1,477 

16 

2.0 

12.70  ±  6.68 

14,733  ± 4,688 

32 

2.0 

4.00  ±  3.84 

10,  030  ±  3,328 

4 

4.0 

23.80  ±  1.32 

20,  234  ±  55 

8 

4.0 

22. 10  ±  3.19 

20,  252  ±  66 

16 

4.0 

23.70  ±  3.12 

20,  254  ±  66 

32 

4.0 

23.00  ±  .3.21 

20,  240  ±  72 

p  =  1.0,  A  = 

0.004 

T„  -  0.  1 
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The  learning  curve  for  the  Ar_p  with  random  prediction  is  shown  in  Fig.  13. 
Its  performance  is  worse  than  that  of  the  Ar_p  with  penalty-prediction  method, 
suggesting  that  there  is  an  advantage  in  predicting  penalties.  However,  it  performs 
better  than  the  simple  AR_P  method.  Thus,  there  is  also  an  advantage  to  taking 
undirected,  large  steps  in  the  search  for  weight  values  for  unused  units.  The  increase 
in  performance  of  the  Ar_p  with  penalty  prediction  method  over  the  simple  AR  p  is 
probably  due  to  both  effects. 

Summary  of  Comparative  Simulations 

To  facilitate  the  comparison  of  the  learning  methods’  performance  on  the  mul¬ 
tiplexer  task,  most  of  the  learning  curves  are  superimposed  in  Fig.  14.  Recall  that 
the  errors  per  time  step  are  plotted  by  averaging  over  30  runs  and  over  bins  of  3, (XX) 
step  intervals.  A  non-learning,  random  strategy  of  selecting  outputs  would  result  in 
an  average  of  0.5  errors  per  time  step. 

It  is  easily  seen  that  the  classes  of  methods  in  order  of  decreasing  performance 

are 

1.  error  back-propagation  (excluding  Rosenblatt's  method). 

2.  reinforcement  learning,  and 

3.  direct  search. 

This  ranking  is  supporter!  by  the  values  of  the  performance  measures,  shown  in  fa¬ 
ble  I  1,  where  the  methods  are  ranked  according  to  their  resulting  values  of  //  There 
is  no  statist ically-significant  difference  between  the  values  of  //  for  the  two  versions 
of  the  Rurnelhart  et  al.  method.  However,  the  ddlereme  between  these  methods 
and  the  best  reinforcement-learning  method,  the  \M  \ with  penalty  prediction,  is 
significant. 

Among  the  reinforcement -  learning  met  hods,  some  di Herein  es  m  //  arc'  sign i lie  ant . 
while  ot  hers  are  not .  In  part  icular.  the1  results  of  t  In-  AS  -  R  I*  met  hod  are  significant  l> 
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Figure  14:  Learning  Curves  for  All  Methods  on  the  Multiplexer  Task 
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Table  11:  Performance  Summary  for  Multiplexer  Task 


method 

V 

M 

parameters 

Rumelhart  0.00  ±  0.00 

sign  of  output  weight 

1, 354  i  575 

p  =  0.25,  pm  =  0.9 

Rumelhart 

O.OOiO.OO 

1,962  i  148 

p  —  0.25,  pm  =  0.9 

Ar_p  with 
penalty  prediction 

0.08i0.17 

7,411il,773 

p-  1,  A  =  0.004, 

Pa  -  16,  Wa  =  1.5,  Ta  =  0.1 

Ar_p  with 
random  prediction 

O.OliO.OO 

10,695i2,690 

p  =  1,  A  =  0.004, 

Pa  -  16,  Wa  =  1.5,  Ta  =0.1 

Ar-p 

O.OliO.Ol 

15,725i3,129 

p  =  1,  A  =  0.004 

Ar  p  with 
local  reinforcement 

0.65il.08 

20,467i6,923 

p  =  0.6,  A  =  0.0001 

AS-RP 

3.36il.98 

48,754i8,662 

p  =  0.16,  pp  =  0.01 

polytope 

14.2i2.09 

94,977i3,079 

n  =  1600,  m  =  10, 
cr  =  2,  ce  =  2,  cc  =  0.2 

guided  random 

13.li2.36 

103,866i3,420 

n  =  3200,  r  =  1 

unguided  random 

17.0i2.93 

1 15,062i229 

n  =  1600 

Rosenblatt 

23.9il.58 

121,1 15i92 

p  -  0.5, 

Pt  -  0.9.  p2  =  0.3,  p3  -  0.1 
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worse  than  all  other  reinforcement-learning  results.  As  discussed  earlier,  another 
version  of  the  AS-RP  method  should  be  tested:  the  output  of  the  hidden  units 
should  be  included  as  input  to  the  reinforcement  predictor,  thus  not  restricting  the 
reinforcement  prediction  to  be  a  linear  function  of  the  input  as  originally  represented. 
All  other  differences  are  significant.  The  direct  search  methods  are  significantly  worse 
than  others  and  their  relative  ranking  is  also  significant. 

Now  we  can  ask  what  new  features  actually  developed  during  successful  runs  of 
multilayer  learning  methods,  that  is,  to  what  sets  of  input  patterns  did  the  hid¬ 
den  units  tune?  For  a  partial  answer  to  this  question,  we  analyzed  two  runs,  one 
with  Rumelhart  et  al.’s  method  and  the  other  with  the  Ar-p  with  penalty  prediction 
method.  Each  run  was  interrupted  at  three  points  to  determine  the  features  that  the 
hidden  units  had  acquired  at  various  stages.  Fig.  14  shows  that  a  single  run  using 
the  Rumelhart  et  al.  method  is  very  likely  to  have  solved  the  multiplexer  task  by 
the  10,000th  step,  so  the  run  was  analyzed  after  2,000,  5,000,  and  10,000  steps.  The 
results  of  this  analysis  appear  in  Table  12.  A  unit’s  state  is  specified  by  a  logical 
expression  for  the  union  of  all  input  vectors  for  which  the  output  of  the  unit  is  1. 
For  example,  unit  1  on  the  2,000th  step  responds  with  output  1  for  input  vectors 
(0,0, 0,0,0, 1)T  and  (0,0,0, 1,0,  l)r  (disregarding  the  constant  component  of  the  in¬ 
put  vectors).  Labeling  the  components  of  the  input  vectors  as  (aj ,  <12,  di,  dj,  d3,  d4), 
for  address  lines  aj,  a-i  and  data  lines  d\,  d.^,  d3,  d.4.  A  minimal  logical  expression  for 
the  union  of  these  vectors  is  a^did^d^.  Included  with  each  hidden  unit  expression 
is  the  approximate  value  of  the  unit’s  output  weight,  indicating  how  that  unit  affects 
the  activation  of  the  output  unit. 

In  addition  to  the  hidden-unit  analysis,  expressions  were  determined  for  the  out¬ 
put  unit,  both  with  and  without  the  features  generated  by  the  hidden  units.  Let  us 
start  our  discussion  of  Table  12  with  these  expressions,  by  first  studying  the  last  row. 
At  step  2,000,  a  relatively  complex  expression  developed  for  the  output  unit,  but  by 
step  10,000  the  unit’s  expression  is  exactly  the  multiplexer  expression,  as  expected. 
The  expressions  for  the  output  unit  without  hidden  units  show  that  at  step  10,000 
the  new  features  learned  by  the  hidden  units  are  necessary  for  the  generation  of  the 
correct  output  for  input  vectors  containing  three  of  the  four  possible  addresses;  for 
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Table  12:  New  Features  Developed  by  the  Error  Back-Propagation  Method 


Step  2,000 

Step  5,000 

Step  10,000 

Unit  1 

aidididsdt 

did2(d\d3  v  d\d2dA}v 

a\a2(d\d2d$  V  d\d3dA) 

0i02(e?id3  V  i\dA)\/ 
a\a2(i\d2  V  d\dA) 

(-2) 

(-7) 

(-7) 

Unit  2 

Oi02</i 

d]  03(^1 

(-2) 

(-7) 

(-11) 

Unit  3 

null 

d\d2d2d3dA\/ 
d\a2(d2dA  V  d2d3  V  d\d2) 

d]  02(^2 

(-1) 

(-9) 

(-12) 

Unit  4 

null 

d\a2(d\d3dA  V  d2d3dA)\/ 
aid2(did3dA  V  d2d3dA) 

<*\d.2d3 

(-D 

(-3) 

(-7) 

output 

aid2(d3dA  V  djd3 

0102V 

d]d2V 

unit 

wd2dA  V  d\d2dA 

®1 02  V 

di  02  V 

without 

\Zdyd$d.A  V  </ 1 (^2 c?3 )v 

0102V 

0102V 

hidden 

units 

aia2(d\d2dA  V  d\d$dA 
vd\d2d$  V  d2dsdA)v 
(i\d.z{d^d^  V  d2d2 

\/d2dA)v 

a\a.2(d\d2dA  V  d\d$dA 
vd\d2d2  V  d2d2dA) 

a\a2{dA  V  d2d3) 

0102^4 

output 

d\a2{did2dA  V  d\d2dA 

aya2d\V 

Q\Q2d\  V 

unit 

wd2d3)\/ 

d]  a2d2\' 

di02(l2V 

with 

d\a2[d\d2dA  v  d\d3dA 

a  1 02(^3  v  d\d2d3)\ 

O] a2d3 v 

hidden 

units 

vd2d3)  V 
<i\d2(d\d2d 4  V  d\d3dA 

vd2d3)v 
<i\<i2(d\d2dA  V  d\d3dA 
vd]d2d3  V  d2d3dA) 

<i\a2(dA  v  (/2d-i) 

0 1  o2dA 

6;? 


'  1 
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address  (1,1)  the  output  unit  itself  is  capable  of  producing  the  correct  output. 

Given  the  expression  for  the  output  unit  without  hidden  units  at  step  10,000,  it 
is  clear  how  the  terms  formed  by  units  2,  3,  and  4  are  being  used.  All  have  negative 
influences  on  the  output  unit,  effectively  carving  out  of  the  output  unit’s  expression 
those  input  vectors  for  which  the  output  unit  produces  a  1  when  the  correct  output 
is  0.  The  role  played  by  unit  1  is  much  less  clear,  and  would  require  a  careful  analysis 
of  exact  weight  values  for  us  to  understand. 

Table  13  shows  the  results  of  a  similar  analysis  of  a  run  with  the  AR-pwith 
penalty  prediction  method.  The  run  was  interrupted  at  10,000,  20,000,  and  50,000 
steps,  a  larger  total  number  of  steps  than  was  used  for  the  analysis  of  the  Rumelhart 
et  al.  method.  The  expression  for  the  output  unit  with  hidden  units  at  the  50,000th 
step  is  indeed  the  multiplexer  expression. 

The  manner  in  which  the  hidden  units  interact  with  the  output  unit  to  produce 
the  correct  output  is  not  as  straightforward  as  it  was  for  the  previous  example.  It 
is  clear  that  unit  4  has  acquired  the  same  role  here  as  it  did  in  the  other  run.  This 
is  purely  a  matter  of  coincidence,  since  there  is  no  a  priori  bias  among  the  hidden 
units;  initially,  each  unit  is  equally  likely  to  develop  a  particular  feature. 

In  conclusion,  this  section  presents  the  results  of  several  methods  for  learning  in 
multilayer  networks  on  a  particular  learning  task.  The  task  is  of  sufficient  scale  that 
direct  search  methods  perform  poorly,  never  solving  the  task  within  the  allotted  time. 
Several  error  back-propagation  methods  were  studied,  of  which  Rumelhart  et  al.'s 
method  readily  dealt  with  the  difficulties  of  the  task,  reliably  solving  it  within  sev¬ 
eral  thousand  steps.  Reinforcement-learning  methods  were  also  tested,  with  various 
degrees  of  success. 

In  comparing  the  performance  results  described  in  this  section,  it  is  important  to 
keep  in  mind  two  critical  limitations  of  t  his  study.  The  most  obvious  limitation  is  t  hat 
a  single  task  was  used.  The  results  provide  no  indication  of  how  the  relative  ranking 
of  the  methods  would  change  if  different  tasks,  either  simpler  or  more  complex,  are 
used.  Answers  to  the  question  of  how  well  the  methods  scale-up  to  harder  tasks 
require  further  experiments  on  tasks  of  varying  complexity.  A  related  issue  is  how 
a  method’s  performance  is  affected  by  altering  the  network  architecture,  such  as  the 

til 


Table  13:  New  Features  Developed  by  Ar_p  with  Penalty  Prediction  Method 
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addition  or  removal  of  hidden  units.  Neither  issue  was  investigated  by  this  study. 

The  second  limitation  is  due  to  the  manner  in  which  the  values  for  each  method's 
parameters  were  chosen.  The  experimenter  became  part  of  every  learning  method 
by  trying  a  number  of  different  parameter  values.  The  values  that  resulted  in  the 
best  performance  for  a  particular  method  were  used  in  its  comparison  with  the  other 
methods.  The  time  required  to  perform  this  parameter  optimization  process  is  not 
taken  into  account  by  the  performance  measures  used  in  this  study.  A  method  might 
rank  very  well  according  to  the  performance  measures  but  be  very  sensitive  to  its 
parameter  values  and  require  much  effort  to  find  optimal  parameter  values.  This 
does  not  appear  to  be  the  case  for  the  results  reported  in  this  section.  The  method 
with  the  best  performance  is  Rumelhart’s  error  back-propagation  method  modified 
to  use  the  sign  of  the  output  weight,  and  it  reliably  solves  the  multiplexer  task  for  a 
wide  range  of  parameter  values. 


Some  Further  Experiments — The  Batched  AR  P  Method 

In  Section  3,  I  discussed  the  relationship  between  the  operation  of  Ar-p  networks 
and  the  error  back-propagation  scheme  of  Rumelhart  et  al.  [44]  and  mentioned  the 
result  of  Williams  [6l|  that  the  weights  in  an  Ar_p  network  (with  A  =  0)  move 
according  to  an  unbiased  estimate  of  the  gradient  of  the  global  network  reward  prob¬ 
ability.  This  fact  suggests  that  it  might  be  worthwhile  to  consider  a  sampling  process 
in  conjuction  with  the  Ar  p  method.  If  the  units  could  obtain  a  better  estimate  of  a 
true  gradient  through  repeated  sampling,  then  the  performance  would  improve  and 
in  the  limit  approach  the  performance  obtained  with  deterministic  gradient  descent 
methods.  Furthermore,  the  network  would  retain  its  simple  character,  in  that  all 
units  would  still  receive  the  same  scalar  signal. 

To  investigate  this  possibility,  we  considered  a  modification  of  the  standard 
Ar  p  learning  procedure.  The  standard  procedure  consists  of  the  following  sequence 
of  events  which  occurs  each  time  a  stimulus  is  presented:  The  network  determines 
its  output,  this  output  is  evaluated,  the  evaluation  is  broadcast  to  all  units,  and  the 
units  change  their  weights.  The  modification  consists  simply  in  allowing  this  updat- 


ing  sequence  to  take  place  several  times  during  the  presentation  of  a  single  stimulus. 
Furthermore,  the  weight  changes  induced  by  these  updates  are  accumulated  in  a  tem¬ 
porary  location;  only  at  the  end  of  the  stimulus  presentation  are  the  accumulated 
weight  changes  added  to  the  actual  weights.  Geometrically,  this  procedure  amounts 
to  obtaining  several  sample  vectors  at  a  given  point  in  weight  space,  and  taking  a 
step  which  is  the  resultant  of  the  sample  vectors.  In  the  experiment  to  be  reported 
below,  the  network  computed  a  batch  of  ten  such  sample  vectors  for  each  stimulus 
presentation.  We  call  this  procedure  the  “batched”  Ar_p  method. 

We  wished  to  compare  the  batched  procedure  to  the  standard  procedure  in  terms 
of  the  time  needed  to  learn  the  multiplexer  task.  Note  that  the  learning  time  is 
a  function  both  of  the  direction  and  the  size  of  the  steps  in  weight  space  taken 
by  the  network.  Since  we  were  interested  in  the  ability  of  the  batched  process  to 
improve  the  direction  of  these  steps,  it  was  important  to  control  for  the  step  size. 
To  do  this,  we  first  computed  the  average  step  size  taken  on  the  first  few  learning 
trials  by  networks  using  both  the  standard  Ar_p  learning  procedure  and  the  batched 
procedure.  3  The  ratio  of  these  average  step  sizes  was  then  used  to  scale  the  learning 
rates.  In  particular,  the  learning  rate  p  for  the  standard  procedure  was  chosen  to  be 
0.5,  and  the  learning  rate  for  the  batched  procedure  was  then  taken  to  be  0.079,  so 
that  the  step  size  per  stimulus  presentation  was  the  same  in  the  two  cases.  Note  that 
in  the  case  of  a  deterministic  method,  the  learning  rate  for  the  batched  procedure 
would  have  to  be  0.05,  given  that  there  are  10  samples  per  presentation;  the  actual 
step  taken  per  stimulus  presentation  would  be  the  same  for  the  two  procedures.  For 
the  stochastic  method,  however,  the  steps  can  be  in  different  directions.  The  fact 
that  the  learning  rate  for  the  sampling  procedure  was  larger  than  0.05  indicates  that 
the  sample  vectors  tend  to  point  in  different  directions  and  cancel,  which  is  of  course 
necessary  if  sampling  is  to  have  any  effect. 

The  architecture  used  in  this  experiment  was  the  same  as  that  used  in  previous 
studies  of  the  multiplexer  two  layers  of  weights,  with  six  input  lines,  four  hid¬ 
den  units,  and  a  single  output  unit  (Fig.  10).  The  hidden  units  learned  using  the 
Ar  rule,  while  the  output  unit  learned  using  the  perceptron  rule.  The  evaluation 

’The  step  size  was  computed  as  t- lie  Euclidean  norm  of  the  vector  Ai />. 


signal  was  a  deterministic  function  of  the  output  of  the  network — if  the  output  was 
correct,  the  evaluation  was  one,  otherwise  it  was  zero. 


The  results  are  shown  in  Fig.  15.  The  abscissa  represents  bins  of  200  trials, 
where  a  trial  refers  to  a  single  stimulus  presentation.  As  discussed  above,  with  this 
definition  of  a  trial,  the  two  learning  procedures  are  equated  in  terms  of  average  step 
size  in  weight  space.  The  ordinate  shows  the  average  percentage  error  for  passes 
through  the  64  possible  stimuli.  This  error  is  a  mean  over  the  200  trials  in  the  bins 
on  the  abscissa.  Furthermore,  each  curve  represents  an  average  over  25  replications 
of  the  experiment.  As  can  be  seen,  there  is  a  substantial  improvement  in  using  the 
sampling  procedure  as  compared  to  the  standard  procedure.  If  we  use  a  percentage 
error  of  five  percent  as  a  learning  criterion,  then  the  sampling  procedure  learns  2.8 
times  faster  than  the  standard  procedure. 


Figure  15:  Learning  Curves  Comparing  the  Standard  and  Batched  Ar_p  Methods 
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This  result  shows  that  it  is  possible  to  obtain  increasingly  accurate  estimates  of 
a  gradient,  without  requiring  a  complex  error  propagation  process.  There  are  both 
practical  and  biological  implications  of  this  result.  Suppose,  for  example,  that  in 
some  learning  domain  it  is  costly  to  obtain  stimulus  items,  but  that  it  is  not  costly 
to  update  the  network  and  obtain  evaluations.  In  such  a  domain,  it  might  be  practical 
to  use  the  sampling  procedure  to  speed  learning.  From  a  biological  point  of  view, 
the  batched  approach  emphasizes  the  point  that  the  agent  evaluating  the  output 
of  a  network  need  only  be  external  to  the  network,  and  not  necessarily  external  to 
the  organism.  If  some  internal  agent  has  sufficient  knowledge  to  be  able  to  evaluate 
actions,  in  particular  if  the  agent  constitutes  a  model  of  the  environment,  then  it  is 
possible  to  improve  learning  through  the  batched  method  without  going  through  the 
environment. 


SECTION  5 


Pole-Balancing  Again 


In  previous  research  we  used  a  version  of  the  pole-balancing  task  (or  inverted- 
pendulum  task'  to  investigate  the  capabilities  of  the  learning  methods  we  had  de¬ 
veloped  [13,47,15].  The  pole-balancing  task  is  an  example  of  what  can  be  called  a 
strategy-learning  task.  A  difficult  temporal  credit-assignment  problem  complicates 
this  kind  of  learning — there  is  no  standard  with  which  to  compare  the  system's  ac¬ 
tions  on  every  step.  This  problem  rules  out  the  use  of  most  connectionist  learning 
methods  for  strategy  learning  because  most  of  these  methods  require  knowledge  of 
the  correct,  or  desired,  actions  for  a  training  set  of  input  vectors.  Learning  proceeds 
by  the  presentation  of  input  vectors  from  the  training  set  and  the  modification  of 
weights  in  a  manner  that  is  dependent  on  the  error  between  the  correct  action  and 
the  actual  action,  as  was  done  for  the  experiments  of  Sections  3  and  4.  For  both  the 
pole-balancing  task  and  the  Tower  of  Hanoi  task  described  in  Section  6.  the  training 
information  arr  ves  in  the  form  of  a  failure  or  success  signal  after  a  series  of  actions 

Our  earlier  v'ork  with  t  he  pole- balancing  t  ask  assumed  t  lie  existence  of  a  represen¬ 
tation  for  the  cart-pole  system’s  state  consist  ing  of  a  large  number  of  non-overlapping 
“boxes”  produc'd  by  a  pre-existing  decoder.  (liven  this  representation,  the  task  be¬ 
came  one  of  fil  ing  in  look-up  tables  one  to  specify  an  evaluation  function  and 
one  to  specify  control  actions.  This  simplified  representation  allowed  us  to  -.epatale 
representation  issues  from  the  issues  of  temporal  iredit  assignment  In  tin  si  pin 
I  reported  here,  the  pre-existing  decoder  |s  rephned  ‘u  .1  I  a  \  e  red  adaptor 
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( ACIO)  and  t lie  Associative  Search  Klemeiit  (  ASK)  ol  previous  ><1  tidies  were  <  ombined 
with  t  lie  error  back-propagation  method  of  Ruinelhart  et  al.  4  1  .  Consequently,  t  lie 
architecture  used  consists  of  two  networks:  the  f valuation  artwork  for  learning  an 
evaluation  function,  which  is  an  elaboration  of  the  A<'K.  and  the  action  artwork  for 
learning  action  heuristics,  which  is  an  elaboration  of  the  ASK 

The  networks  and  learning  methods  are  described  first.  Since  the  networks  and 
the  learning  methods  used  in  the  pole-balancing  task  and  the  Tower  of  Hanoi  puzzle 
(desc  ribed  in  Section  (>)  are  very  similar,  in  this  section  we  describe  strategy  learning 
networks  in  a  way  that  is  general  enough  to  encompass  t tie  systems  used  in  both 
of  these  tasks.  We  then  briefly  describe  the  pole-balancing  simulation  and  how  the 
learning  networks  interact  with  it  Finally.  results  of  simulation  experiments  are 
described  This  section  is  an  edited  form  of  Chapters  Y  ami  VI  of  C.  W.  Anderson’s 
d  isscrt  at  ion  t 
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Strategy  Learning  Networks 

\  presented,  'he  network'-  have  lust  two  layers,  but  the  learning  methods  are 
•  1 1 \  * \trndeii  iii  additional  layers  The  evaluation  network  and  action  network 

"i'1  nei  ess  ar  1 1  \  have  the  same  number  o|  hidden  or  output  units,  but  since  the 
ui.it  net  w  uk  being  discussed  o  alwavs  obvious  from  the  context,  the  same 
a'dc-v  a  t .  used  ind>A  units  m  f  »< « t  h  networks  1  ,c*t  there  )>e  iah  hidden  units  an<l 
• .  1  ;>  1 1  -ii.  i  a  total  of  at  m„  •  m  units  The  hidden  units  are  indexed 

I  a  .d  th<  "  It  put  units  are  indexed  from  mh  ■  I  to  in.  (The  evaluation 

n*  •  '  a-  <>•  .  ,i  -mgu  output  unit  i  l  et  II  and  < '  respectively  denote  the  sets  of 

i-  .•>  .  — ; e : t  units  1  h»  evaluation  and  action  networks  do  not  share 

a.  ■  w.  '.niie.i  .n,M.|  ihr  difiu  ult  M's  of  integrating  the  hidden-unit 
<  •  •  •  g  . . .  .  .  ,o  I  a  I  ion  net  w  or  k  a  to  I  t  he  a<  t  |o||  net  Work  . 

'  '  ■  ■  ,  •  ..ve  ■  \  |t,  I  hr  tr, angles  represent  the  “eomputalion- 

■  <■  ■’  •  ■  ■  ;  ■  ■  •  ve,  ;<••-  a  t  r  ■  •  bom  tin  i.-b  ain't  “pass  through"  their 

■•i  •  eptesc  rited  b-,  intersections  (>f  horizontal  and  vertical 

'  •  ■  s.  wuere  an  outp'it  |s  computed  an«l  sent  out  the  output 


Figure  16:  Two-Layer  Networks  for  Strategy  Learning 


lines  emanating  from  the  apex  of  the  triangles.  Input  from  the  environment  at  time 
step  t,  denoted  by  z0|j),  Tijt], . . . ,  £„[<],  is  provided  to  all  hidden  units  and  output 
units.  There  is  an  interconnection  weight  at  every  intersection — hidden  units  receive 
n  +  1  inputs  and  have  n  f  1  weights  each,  whereas  output  units  receive  n  +  1  +  mh 
inputs  and  have  n  +  1  +  m*  weights.  For  the  evaluation  network,  the  weight  associated 
with  the  itfl  input  to  unit  j  at  time  step  t  is  denoted  the  analogous  weight  of 

the  action  network  is  denoted  utjyji]. 

The  learning  rule  for  the  evaluation  network  is  composed  of  Sutton’s  [47]  Adap¬ 
tive  Heuristic  Critic  (AHC)  method  for  the  output  unit  and  Rumelhart,  Hinton,  and 
William’s  |44|  error  back-propagation  scheme  for  the  hidden  units.  The  AHC  rule 
results  in  a  prediction  of  future  reinforcement  for  a  given  state.  Changes  in  this  p re¬ 
de  t  ion  are  used  as  heuristic  reinforcement  to  guide  the  learning  of  s  arch  heuristics 
h\  'he  action  retwork  T  he  output  units  of  the  action  network  use  an  associative 

•  i  ‘anting  method  ident  ical  to  the  one  used  in  our  earlier  pole- ha  lancing 

i  '•  17  ; <  It  her  associative  reinforcement -learning  methods,  such  as  the 

■  K  '  .  «  -'"iid  h<  us«-d  for  t  he  output  unit  The  An  p  rule  would  require 


an  additional  mechanism  for  restricting  the  heuristic  reinforcement  to  between  0  and 
1.  We  chose  to  draw  on  our  previous  experience  with  the  single-layer  network  |I3| 
by  employing  the  reinforcement-learning  rule  used  there.  This  rule  is  combined  with 
the  error  back-propagation  method  for  hidden  units. 

Output  Functions 

Evaluation  Network  The  output  of  the  evaluation  network  is  computed  in  the 
following  way.  First,  the  outputs  of  the  hidden  units  are  calculated.  The  output,  p;, 
of  hidden  unit  j  is  calculated  using  the  values  of  its  weights,  e,  at  time  <„  and  input, 
x,  at  time  tz,  as  follows: 

<  for  j  >  //, 

'  i  -  0  ' 

where  /  is  the  logistic  function  /(. s)  1/1  t  e  The  multiplication  of  weight  and 
input,  vectors  from  different  time  steps  is  required  by  the  learning  rules  for  reasons 
described  below. 

The  input  vector,  y,  for  the  output  unit  is  composed  of  the  input  from  the  envi¬ 
ronment  and  the  output  of  the  hidden  units: 

y,[<x,  -j  :  for  i  0 . n, 

y.| Pi  n(<zXj,  for  l  77  *  1 . 7?  *  777*. 

The  index,  m,  of  the  single  output  unit  is  dropped  from  pm  for  clarity.  Thus,  the 
output  of  the  evaluation  network  is  p,  and  is  defined  as 

n  *  m,, 

p\*T,  f »■  I  .  y 1 1 ^ x '  ^ i  :  e,  m  i ^ , 

I  II 

Action  Network  To  define  the  output  of  the  action  network  we  first  define  the 
hidden  unit  outputs,  a,: 
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for  j  *  //. 


These  values  partly  determine  the  input  vector,  2,  for  the  output  units,  along  with 
the  input  from  the  environment: 

2,|<|  £,(<],  for  i  0 . n, 

2,|*|  -  „|<),  for  i  =  n  +  1, .  .  . ,  n  +  mh. 

P'or  the  experiments  in  later  chapters,  the  output  components,  a;,  j  (=  O,  of  the 
action  network  are  in  one-to-one  correspondence  with  the  possible  actions  defined 
for  the  task.1  To  select  an  action  for  a  given  problem  state,  the  output  of  one  output 
unit  is  set  to  1  and  the  outputs  of  other  units  are  set  to  0  by  the  following  process. 
The  output  functions  of  the  reinforcement-learning  units  are  stochastic,  i.e.,  their 
output  depends  on  a  noisy  weighted  sum  of  inputs.  A  competition  among  the  output 
units  is  implemented  by  assigning  the  value  1  to  the  unit  with  the  highest  weighted 
sum  plus  noise.  This  competition  is  limited  to  units  corresponding  to  legal  actions 
for  the  current  state.  Let  Lt  C  O  be  the  set  of  indices  for  the  output  units  that 
represent  legal  actions  for  the  state  at  time  t.  The  determination  of  Lt  at  each  time 
step  can  be  implemented  by  a  network  and  even  learned  through  experience,  though 
for  our  experiments  we  specified  Lt  a  priori.  The  responses  of  the  output  units  are 
calculated  as  follows. 

Let  .s;  be  the  noisy  weighted  sum  of  the  input  for  unit  j ,  j  <-  L(,  defined  as 

•Vi*!  X!  2«l'i,1Vjl'l  ♦  V, i*!. 

1  0 

where  rjj\t\  is  random  variable  with  distribution  function  (for  the  pole-balancing 
task,  'l'  is  the  logistic  distribution).  The  unit  with  the  largest  value  for  .s;  wins  the 

Ratin'!  than  representing  actions  in  this  localized  way.  each  action  can  lx-  encoded  hv  a  pattern 
of  output-unit  activity.  For  example,  the  six  possible  actions  f"i  the  Towei  of  Hanoi  puzzle  could 
he  represented  as  patterns  of  output  values  over  three  output  units  This  can  lead  to  genei  alizat  ion 
among  actions  represent  eel  hv  similar  output  patterns,  which  can  eitliei  henefil  or  hinder  the  learning 
of  rorrec  I  actions  These  issue-- are  not  achlressed  l.\  l  lie  i  epie-.-nt  at  ion  de<c  nh<-d  in  t  lie  text ,  alt  hough 
the  reinforcement  |cai  mug  method  is  capable  .  ,f  dealing  with  i  |,c-  .  i  .-d  it  assignment  ptoblrm  that 
1  *'SI|  It  -  when  output  pattein--  encocle  action- 
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competition  and  is  assigned  a  nonzero  output: 


1,  if  s,[<]  >  for  k  €  Lt  and  j  k\ 

a;(<l  =  ' 

0,  otherwise. 

To  simplify  the  determination  of  the  unit  with  the  largest  Sj  for  the  Tower  of  Hanoi 
task,  the  following  exponential  probability  distribution  is  used  for 

\H(q)  =  l  -  e~q . 

The  output  function  can  be  simplified  for  tasks  with  only  two  possible  actions 
for  every  state,  such  as  the  pole-balancing  task.  A  single  output  unit  is  used  whose 
binary  output  values  encode  the  two  actions.  Let  this  unit  be  unit  fc,  i.e.,  O  —  {&}. 
The  specialization  of  the  output  function  for  this  case  is: 

1,  if  sk[t\  >  0; 

a*[*|  =  ' 

0,  otherwise. 

Learning  Rules 

Output  Layer  of  Evaluation  Network— The  change  in  p  plus  the  value  of  the 
external  reinforcement  r  is  called  the  heuristic  reinforcement,  or  r: 

r\t\  ~  r(<]  4  -7 p\t,t  -  lj  -  p\t  1  ,t  l|, 

where  (3  -  7  -  1,  called  the  discount  rate.  The  use  of  f  in  updating  the  weights  of 
the  evaluation  network’s  output  unit  results  in  a  prediction  of  future  discounted  rein¬ 
forcement  for  the  current  state,  with  reinforcement  farther  in  the  future  discounted 
more  than  earlier  reinforcement  |47,46|.  States  for  which  p  is  relatively  large  are 
favorable,  while  those  with  relatively  low’  p  are  to  be  avoided.  Once  this  mapping 
is  correctly  formed,  changes  in  p  can  be  used  to  indicate  whether  recent  actions  are 
leading  toward  favorable  or  unfavorable  states. 

The  double  time  dependencies  of  variables  in  the  equations  for  the  evaluation 
network  are  needed  for  the  following  reason.  In  comparing  one  value  of  p  with  a 


previous  value,  care  must  be  taken  to  avoid  instability  in  the  growth  of  weight  values 
(equations  for  changing  weight  values  are  presented  shortly).  If  the  computation  of 
p  for  step  t  -  1  uses  v[<  -  1]  whereas  p  for  step  t  uses  v\t\,  then  a  change  in  p  from 
one  time  step  to  the  next  could  be  caused  by  a  change  in  weight  values  rather  than 
the  encounter  of  a  state  with  a  different  expectation  of  reinforcement.  To  avoid  this, 
the  pair  of  subsequent  p’s  is  based  on  a  single  set  of  weight  values,  i.e.,  the  difference 
between  p  for  step  t  —  1  and  for  step  t  is  due  only  to  the  change  from  x,\t  —  1]  to  i,[t], 
because  both  p’s  are  calculated  using  —  lj.  If  weights  are  known  to  change  by 
small  magnitudes  on  each  step,  then  this  precaution  may  not  be  necessary  (as  done 
in  Ref.  [13]). 

Sutton  [47]  specialized  the  AHC  rule  by  redefining  f  for  several  classes  of  tasks 
involving  distinct  trials ,  where  a  trial  consists  of  the  following  steps: 

1.  setting  the  state  of  the  problem  to  a  start  state , 

2.  letting  the  learning  system  and  environment  interact,  until 

3.  a  goal  state  or  failure  state  is  encountered,  signaled  by  a  particular  external 
reinforcement  value. 

Following  Sutton,  f  for  trial-based  tasks,  such  as  those  considered  here,  is  defined  to 
be: 

0,  if  state  at  time  t  is  a 

start  state; 

f\t\  -  <  r\t  p\i  1,<  -  1],  if  state  at  time  t  is  a  goal  (5.1) 

or  failure  state; 

r\t  t  ip\t,t  l|  --  p\t  -  1,<  -  1 J ,  otherwise. 

The  weights  of  the  output  unit,  unit  in,  of  the  evaluation  network  are  updated 
by  the  following  equation: 

".,m[/|  t|  -t  lir\t\yt\t  1 ,/  1 1 , 

for  i  0,  ...,n  t  mh  and  f3  >  0.  A  positive  change  in  state  evaluations,  indicated 
by  a  positive  r,  results  in  an  increase  (decrease)  in  weight  values  proportional  to 


the  corresponding  positive  (negative)  input  values  on  the  preceding  steps.  In  this 
way,  the  evaluation  of  the  preceding  state  is  altered,  effectively  shifting  evaluations 
to  earlier  states. 

The  above  expression  is  a  simplification  of  Sutton’s  learning  rule:  in  its  general 
form,  y,|f  -  1,/  -  l)  is  a  trace  of  previous  values  of  y,,  called  an  eligibility  trace. 
An  example  of  an  eligibility  trace  is  a  weighted  average  of  past  values  of  y,  with 
recent  values  weighted  more  heavily.  This  generally  results  in  faster  development 
of  good  evaluation  functions.  Eligibility  traces  can  also  be  used  in  the  weight  up¬ 
date  equations  of  the  action  network.  We  chose  not  to  implement  eligibility  traces 
primarily  for  the  following  reason.  Preliminary  experiments  with  the  pole-balancing 
task  showed  that  a  one-layer  action  network  functioning  with  eligibility  traces  and 
without  an  adaptive  evaluation  network,  i.e.,  learning  only  from  the  external  rein¬ 
forcement,  could  learn  to  perform  relatively  well.  However,  our  interests  were  in 
studying  learning  in  hidden  units,  which  are  required  for  the  development  of  a  good 
evaluation  function  for  the  pole-balancing  task  as  it  is  formulated  here.  We  removed 
the  eligibility  traces  from  both  networks  to  force  a  greater  reliance  on  the  evaluation 
function  and  to  increase  the  number  of  failures  early  in  a  run,  providing  more  exter¬ 
nal  reinforcement  and  thus  more  opportunities  to  improve  the  evaluation  function. 
Thus,  our  primary  goal  was  not  to  achieve  the  fastest  possible  learning  on  this  task 
but  to  investigate  learning  in  hidden  units. 

Output  Layer  of  Action  Network  Output  unit  /,  j  •  of  the  action  network 
updates  its  weights  according  to: 

~  XL\.)\t  1|  tpr|t](a;|f  1|  fc{a,\t  l|jw>;  2}]  2,1/  li, 

fort  0, .  . .  ,  n  t  nt),.  where  f.'{a;[/  1 1  |w;  2 }  is  the  expected  value  of  Oj  j/  I :  con¬ 

ditional  on  the  current  values  of  t r  and  Weight  values  are  not  changed  for  output 
units  corresponding  to  illegal  actions.  The  value  of  art  I  /v‘{n;|/  I  ite; .?}  can 
be  viewed  as  a  measure  of  the  difference  between  action  a}\t  l  and  the  action  that 
is  usually  taken  for  t  he  given  values  of  c,  /  I  and  ?rl; !/  I  Titus,  t  he  results  of  an 
unusual  action  have  more  of  an  impact  on  the  adjustment  of  weights  than  do  other 
actions.  Since  a,  1  {(),  1  }.  the  expected  value  of  n;  is  equal  to  the  probability  that 


a,  is  1,  i.e., 

£{a;it||«;;2}  -  Pr{a,\t\  1}. 

See  Ref.  |4|  for  derivations  of  this  probability  for  the  cane  of  three  actions  the  only 
rases  that  arise  for  the  formulation  of  the  Tower  of  Hanoi  puzzle  used  in  Section  (>. 
The  calculation  of  this  probability  is  easy  for  the  pole-balancing  task  because  there 
are  just  two  possible  actions.  In  this  case.  /V{a|t]  1}  is  just  '!'(<?),  where  q  is  the 

weighted  sum  of  the  unit's  input. 

Hidden  Layer  of  Evaluation  Network  From  the  results  of  the  comparative 
experiments  described  in  Section  4,  we  concluded  that  the  error  back-propagation 
method  of  Rumelhart,  Hinton,  and  Williams  usually  learned  most  rapidly  (for  the 
particular  multiplexer  task  used  in  the  experiments).  However,  this  method  cannot 
be  applied  directly  because  it  requires  knowledge  of  the  correct  output.  Here  we  do 
not  know  the  correct  a<  t ion  or  the  correct  evaluation  for  a  given  state,  which  would 
be  needed  in  order  to  calculate  an  error  to  be  back-propagated. 

To  apply  an  error  hack-propagat  ion  scheme  to  I  he  hidden  units  of  a  network  whose 
output  layer  is  learning  through  reinforcements,  a  way  of  translating  a  reinforcement 
into  an  error  must  bo  found.  This  can  be  done  in  a  heuristic  manner  bv  extracting 
from  the  reinforcement-learning  equal  ions  the  terms  that  govern  weight  updates  in 
a  fashion  similar  to  the  error  terms  in  the  gradient-descent  rules.  However,  it  is  not 
obvious  how  to  incorporate  the  eligibility  traces  often  used  in  reinforcement-learning 
methods  into  a  back-propagation  scheme  (this  is  another  reason  for  not  including 
traces  for  the  experiments  reported  here). 

For  t  he  evaluat  ion  network,  r  plays  t  he  role  of  an  error  in  t  he  update  of  t he  output 
unit's  weights.  Therefore,  we  define  the  error  of  the  output  unit,  to  be: 

t>'m  t  !  ft. 

where  the  superscript  denotes  the  association  with  the  evaluation  network  that  gen¬ 
erates  output  />  The  error  that  is  hack-propagated  from  the  output  unit  to  hidden 
unit  ;  is  just  f,  and  Rumelhart  et  al  s  44  expression  with  Sutton's  |4R|  modification 
for  the  error  of  hidden  unit  /,  caller!  A'.  becomes: 

Afi '  I1  0  I  hgn  if  I)V  ,\t  I ./  11(1  q,\t  l,t  1|), 

7X 


L, 


.<«y.  •r-  -W-  c..-. 


% 


.  -..V 


and  their  method  for  updating  the  hidden  units’  weights  can  be  applied: 

=  W,j|*  -  1|  +  0h6j[t  -  !]*<(*  -  1]  +  0m  Av.jff  -  1], 

for  units  j  f  H  and  inputs  t  =  0, . . . ,  n.  Note  that  the  sign  of  hidden  unit  j's  output 
weight  rather  than  the  weight  value  itself  is  used.  This  variation  is  used  because 
the  results  of  the  comparative  study  reported  in  Section  4  suggest  that  the  method’s 
sensitivity  to  the  value  of  the  learning  rate  parameter,  here  0h>  is  decreased  by  the 
use  of  the  sign  of  the  weight. 

Hidden  Layer  of  Action  Network  —The  equation  for  updating  the  weights  of 
the  action  network’s  hidden  units  is  a  bit  more  complicated.  Once  f  becomes  a  good 
evaluation  of  the  previous  action,  the  role  of  an  error  is  played  by  the  product  of  f 
and  the  difference  between  the  previous  action  and  its  expected  value.  The  sign  of 
the  product  is  an  indication  of  whether  the  action  probability  should  be  increased  or 
decreased.  So  the  error  in  the  output  of  output  unit  k,  k  6  Lt,  of  the  action  network 
is  defined  as: 

K\t  i|  r[<|  (a*!<  i|  £{«*['  i|k;*}). 

The  back-propagated  error  to  hidden  unit  j  is  used  to  compute  the  hidden  unit’s 
error: 


'  !|  XI  l|sgn(u>  _,„>!/  -  l|))e,|f  -  1 1  ( 1  ■  z,\t  -  l)), 

k‘ 


and  the  weights  are  updated  by  the  following  equation: 

«V;|*|  w>.j  |>  l}-\phbj\t  l|x,(f  1|  }  pm  Auy,(<  ■  1|, 

for  units  ;  «  //  and  inputs  t  0 . n.  Disregarding  the  different  errors  that  are 

back-propagated  by  the  two  networks,  the  learning  rule  used  by  the  hidden  units  of 
the  two  networks  are  identical.  The  sum  over  the  products  of  output  unit  errors  and 
weights  is  not  included  in  the  expression  for  a  hidden  unit’s  error  in  the  evaluation 
network  because  there  is  only  one  output  unit. 

Parameters  The  equations  for  the  evaluation  network  are  governed  by  the  follow- 
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ing  parameters: 


0  learning  rate  for  the  output  unit  (0  >  0); 

0h  =  learning  rate  for  the  hidden  units  ( 0 *  >  0); 

0m  =  momentum  factor  for  the  hidden  units  ( 0m  >  0); 

7  =  discount  rate  (0  <  7  <  1). 

Similar  parameters  appear  in  the  equations  for  the  action  network: 

p  =  learning  rate  for  the  output  units  (p  >  0); 

ph  -  learning  rate  for  the  hidden  units  (p^  >  0); 

pm  —  momentum  factor  for  the  hidden  units  ( pm  >  0). 

In  applying  this  system  to  a  task,  it  is  important  to  test  a  number  of  values  for 
each  parameter  to  investigate  the  sensitivity  of  the  methods  with  respect  to  the 
parameters.  This  was  done  for  all  experiments  described  in  this  report. 


The  Pole-Balancing  Task 

In  this  section  we  describe  the  pole-balancing  task,  our  computer  simulation  of 
it,  and  how  this  simulated  system  interacts  with  the  adaptive  networks.  Additional 
details  can  be  found  in  Ref.  (I3|.  Learning  to  solve  the  version  of  the  pole-balancing, 
or  inverted-pendulum,  task  that  we  have  studied  is  nontrivial  for  two  reasons: 

1.  the  evaluation  function  to  bo  learned  is  nonlinear  and  therefore  cannot  be 
formed  by  a  single  linear  unit,  and 

2.  a  performance  evaluation  in  the  form  of  a  failure  signal  appears  only  after  a 
sequence  of  actions  has  been  taken,  making  it  difficult  to  identify  which  actions 
are  good  and  which  are  bad. 

The  pole-balancing  task  involves  a  pole  hinged  to  the  top  of  a  wheeled  cart  that 
travels  along  a  track  (as  described  in  lief.  |I3j).  Both  pole  and  cart  are  constrained 
to  move  in  a  plane.  The  state  at  time  /  of  this  dynamical  system  is  specified  by  four 
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real-valued  variables: 

xt  the  horizontal  position  of  the  cart,  relative  to  the  track; 

xt  —  the  horizontal  velocity  of  the  cart; 

0t  —  the  angle  between  the  pole  and  vertical,  clockwise  being  positive; 

0t  =  the  angular  velocity  of  the  pole. 

The  goal  is  to  exert  a  sequence  of  forces,  Ft,  upon  the  cart’s  center  of  mass  such 
that  the  pole  is  balanced  for  as  long  as  possible  and  the  cart  does  not  hit  the  end 
of  the  track.  More  abstractly,  the  state  of  the  cart-pole  system  must  be  kept  out  of 
certain  regions  of  the  state  space,  making  this  an  avoidance  control  problem.  There 
is  no  unique  solution — any  trajectory  through  the  state  space  that  does  not  pass 
through  the  regions  to  be  avoided  is  acceptable.  A  minimal  amount  of  knowledge 
about  the  task  is  assumed  in  our  experiments.  The  only  information  regarding  the 
goal  of  the  task  is  provided  by  the  external  reinforcement  signal,  rf,  that  signals  the 
occurrence  of  a  failure  caused  either  by  the  pole  falling  past  a  prespecified  angle,  or 
the  cart  hitting  the  bounds  of  the  track,  r,  is  defined  as 


0,  if  0.21  radians  <  0t  <  0.21  radians  and  -  2.4  m  <  xt  <  2.4  m; 
1,  otherwise. 


Note  that  r,  does  not  depend  on  0t  or  i(. 

We  simulated  the  cart-pole  system  using  a  numerical  approximation  of  the  system 
of  nonlinear  differential  equations  that  models  the  system.  Details  are  provided  in 
Ref.  i  13]. 

From  successful  experiments  with  two-layer  systems,  we  found  that  good  evalua¬ 
tion  and  action  functions  look  like  those  sketched  in  Fig.  17.2  For  clarity,  let  us  limit 
attention  to  projections  of  the  functions  to  the  (0.0)  subspace.  Figure  17a  shows  the 
kind  of  function  one  might  expect  the  evaluation  network  to  learn.  Failure  is  likely 
to  occur  in  the  upper  right  and  lower  left  portions  of  the  (0.0)  state  space.  We  want, 

2I)ue  to  the  nature  of  our  formulation  of  the  learning  task  as  an  avoidance  control  proldem.  one 
cannot  say  that  these  functions,  or  any  others,  are  the  tini<|iie  optimal  functions  There  are  vei  y  many 
ways  that,  the  system  can  avoid  the  failure  regions  of  state  space 
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Figure  17:  Good  Functions  for  Pole-Balancing  Solution 


the  learning  system  to  shift  this  failure  signal  to  states  that  precede  failure  states, 
then  to  states  that  precede  failure  by  longer  time  intervals,  with  the  strength  of  the 
shifted  prediction  indicative  of  the  average  number  of  time  steps  until  failure.  Past 
states  and  actions,  or  weighted  averages  of  previous  states  and  actions,  could  be  used 
to  apportion  blame  for  the  failure  to  previous  actions,  but  the  tradeoff  between  a) 
the  need  for  a  long  history  to  blame  actions  many  steps  in  the  past,  and  b)  the  need 
for  a  short  history  to  avoid  blame  being  spread  too  thinly  (resulting  in  slow  learning), 
is  difficult  to  optimize.  These  temporal  credit-assignment  issues  are  studied  in  detail 
by  Sutton  [47,46),  who  developed  the  AHC  learning  rule  used  here  in  the  output  unit 
of  the  evaluation  network. 

The  map  from  (x.  x)  to  a  prediction  of  failure  also  looks  like  Fig.  17a.  In  the  lower 
left  corner,  the  cart  is  moving  to  the  left  and  is  near  the  left  border  of  the  track,  and 
in  the  upper  right  corner  it  is  approaching  the  right  border  of  the  track. 

Figure  17b  shows  an  action  function  for  generating  a  push  on  the  cart.  For 
small  angles,  such  as  0.21  0  0.21  as  used  in  our  experiments,  the  surface 

that  separates  states  requiring  different  actions,  called  the  switching  surface,  can  be 
linear  States  in  the  upper  right  region  require  a  push  to  the  right,  while  states  in 
the  lower  left  require  a  left  push.  The  linear  switching  surface  is  an  approximation  to 
the  nonlinear  switching  surface  of  the  time  optimal  bang-bang  controller.  The  linear 
approximation  works  well  for  the  small  range  of  angles  used  in  the  experiments.  The 
position  and  slope  of  the  linear  surface  varies  for  different  values  of  x  and  x. 
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The  pole-balancing  task  frequently  has  been  used  to  illustrate  standard  control 
techniques  due  to  the  inherent  instability  of  the  pole  and  the  task's  similarity  to 
many  balance-control  problems.  For  example,  Cannon  j  1 7 j  shows  how  the  root- locus 
method  is  used  to  analyze  the  stability  of  a  “lead  compensation-network"  controller 
that  exerts  a  force  proportional  to  the  derivative  of  an  error  signal  in  this  case 
the  pole’s  angular  velocity.  His  analysis  is  confined  to  small  angles  and  angular 
velocities  and  does  not  include  the  goal  of  avoiding  the  bounds  of  the  track.  These 
control-design  techniques  require  a  model  of  the  system  to  be  controlled  in  the  form 
of  differential  equations  that  define  how  the  state  variables  change  over  time.  A 
good  deal  of  time  must  be  spent  by  the  control  engineer  in  determining  a  model 
that  approximates  the  behavior  of  the  system  to  the  desired  degree  of  accuracy. 
Control  systems  that  learn  without  a  predefined  model,  or  that  acquire  internal 
models  through  observation  of  the  system's  behavior,  would  obviate  this  potentially 
difficult  analysis. 

An  alternative  to  expressing  a  control  law  as  an  analytical  equation  is  to  represent 
the  function  in  tabular  form.  Michie  and  Chambers  |35|  took  this  approach  for 
their  learning  system  as  applied  to  the  pole-balancing  task  Their  table  consisted 
of  approximately  162  “boxes”  -  nonoverlapping,  rectangular  regions  of  the  cart-pole 
system's  state  space  containing  average  counts  of  the  number  of  steps  before  failure 
for  a  push  to  the  right  when  the  system's  state  addresses  the  corresponding  box.  and 
an  analogous  count  for  a  push  to  the  left.  When  a  box  is  entered,  the  push  with  the 
highest  count  is  applied.  Their  system  successfully  improved  its  performance  with 
experience. 

Our  previous  learning  system  for  this  task  13  integrated  t  he  table  look-up  ap¬ 
proach  with  connect ionist  learning  methods.  Separate  tables  were  used  to  store 
predictions  of  reinforcement  ami  probabilities  of  generating  a<  lions,  eac  h  indexed  bv 
tin'  state  of  the  cart-pole  system.  The  tables  were  implemented  as  two  units,  each 
receiving  162  binary- valued  input  components  When  the  state  was  in  a  particular 
box.  the  corresponding  input  component  was  set  to  I  and  all  other  components  were 
sc't  to  0.  Therefore.  I  he  weighted  sum  <>l  the  unit's  input  was  equal  to  the'  value  «  i 
the  weight  associated  with  the  nonzero  input  component. 


An  obvious  prot»l«,n i  with  th*’  table  look-up  approach  is  that  tin-  size  shape  .  and 
pla<  ement  of  the  regions  into  which  the  stale  space-  is  divided  greailv  inllnen<e  tin 
ability  of  the-  system  to  learn  the  desitc-d  mappings  \  ngion  might  t»e  too  larg- 
meaning  that  different  states  inside  the  legion  recpiire  cfifferent  output  values,  eg  . 
different  pushes  <>n  the  cart  t'onversih.  legions  are  smaller  than  optimal  when 
many  regions  require  t  he  same  output.  If  these  regions  are  instead  subsumed  b\  one- 
large  region,  then  what  is  learned  lot  one-  state  is  gemei  alized  correctly  to  all  other 
state's  m  the  region  With  many  small  regions,  learning  must  occur  in  all  regions 
independent  l\  Mir  hie  and  (‘handlers  proposes!  a  solution  to  this  problem:  regions 
for  whir  h  one  output  is  not  c  learly  better  than  any  ot  her  despite  repeated  experiene  e 
should  be  “split"  into  several,  smaller  regions,  and  regions  with  the  same  output 
value-  should  be  “lumped"  into  a  single,  larger  region  I’olit  is  and  Lie  at  a  .‘id  have 
pursued  this  possibility  with  Maito,  et  al  s  Id  learning  sy  stem  and  a  technique  foi 
pet  iodic  all\  splitting  everv  region  umforinlv  into  a  number  o|  smaller  regions. 

In  tbc>  experiments  described  li*'re.  the-  bxc'cl  "cjc'c oclc-r  o|  the'  system  described 
in  Kef  Id  to  translate-  the  cart-pole  state-  to  a  region  aeldre-ss  is  replaceel  by  a 
Liver  of  hieldeu  units  that  le-arns  feature's  useful  in  solving  the>  pole-balancing  task. 
‘Hus  "adaptive-  decoder"  view  is  edosely  relatc'e)  to  current  research  topics  in  control 
t  be'orv  involving  the-  applie  at  ion  of  mult  i  pie'  eeml  rollers  to  one'  task.  For  example',  t  lie' 
control  of  a  lull  d(>0-i|egre'e  pole1  rc'cpure's  a  complex  control  law  in  order  to  be  useful 
lot  all  stale-s  \n  alternative  is  to  use-  a  collection  of  less-comple'X  control  laws,  and 
ae  tivatc  one  at  a  time,  based  on  t  fie  current  state*  and  an  orelering  of  the  control  laws 
ae  e  circling  to  t heir  abilitv  to  deal  w  ith  that  state-.  Learning  when  to  switch  from  one 
controller  to  another  is  analogous  to  learning  how  to  classify  the'  state  into  one  of  a 

set  of  boxes 

\uotber  example  of  a  e ounce  tionist-s|\ le  system  applied  to  t  he>  pole-balancing 
task  (cmiics  Irom  the  work  ol  W  id  row  and  Smith  <><  I  .  Thi'\  pri'si'iit  re'sulls  of  using 
a  supervised-learmng  s  heme  to  train  a  network  of  Adaline  units  to  duplicate'  the 
responses  of  a  te-ae  lieu  for  their  experiments  the  li-,ie  h<’r  was  a  preele'line'el  linear 
e  ontrol  law  A  human  e  on  hi  play  the  role  of  t  he-  teae  her  by  manually  eont  rolling  the 
pole  t  hrough  an  inte-rlae  e.  sue  h  as  a  jovst  ic  k.  and  the'  -\c|a!ine  network  <  oulel  use  the 
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tin  man  s  responses  as  (raining  examples.  Learning  to  mimic  a  teacher  is  much  easier 
i  Iran  learning  from  <lelayc<l  reinforcement  as  is  required  for  our  formulation  of  the 
pole- balancing  task,  but  the  work  of  Widrow  and  Smith  lias  been  very  useful  to  us 
m  showing  how  adaptive  units  can  be  applied  to  control  problems. 


Interaction  between  the  Network  and  Cart-Pole  Simulation 

The  components  of  the  networks’  input,  x,[<],  are  scaled  versions  of  the  state 
variables: 

+2.4), 

x2jf]  =  ‘(i|<)+1.5), 

*3|«l  =  nijWI+O-Z1), 

x4\t\  ^  \m  +  2). 

An  additional  input,  x0[t],  with  a  constant  value  of  0.5  provides  a  variable  threshold. 
Inputs  Xij<|  and  13ft]  range  from  0  to  1,  while  x2 [<]  and  x4[<]  are  primarily  within  the 
0  I  range,  but  can  fall  outside  these  bounds.  This  scaling  accomplishes  two  ihings. 
Since  the  learning  rules  involve  the  input  terms,  £,(<),  as  factors  in  the  equations 
for  updating  weight  values,  terms  with  predominantly  larger  magnitudes  will  ha\< 
a  greater  influence  on  learning  than  will  other  terms.  To  remove  this  bias  all  input 

terms  are  scaled  to  lie  within  the  same  range.  Secondly,  since  ihe  values  of  tie  *- 1 , 

variables  are  centered  at  zero,  if  these  values  were  used  directly  as  network  input-  :  >  • 
correct  action  for  positive  0  and  9  would  transfer  to  negative  0  and  0  m  .  \,e  1  , 
right  way,  i.e.,  the  correct  action  for  negative  0  and  9  1 1 1 «  neg.it  i\. 

action  for  positive  0  and  9  (see  f  igure  I  7 1 > I  I  h<  •  <<■*••••  v  ■  •  • 

make  the  task  much  easier,  circumventing  tli<  •  ' 

network.  Thus,  scaling  t  he  taie  car  table  o  >w.,i 

hidden  II  tilt  >•  III  t  he  e  ase  o|  tin  ,n  I  mi:  '!•  '  \\  ‘ 
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The  force  exerted  on  the  cart’s  center  of  mass  at  time  l  is  given  by: 


10  nt,  if  a  t  =1; 


-10  nt,  if  a[t]  =  0, 


where  a[t]  is  the  binary-valued  output  of  the  action  network  at  time  t.  The  sampling 
rate  of  the  cart-pole  system’s  state  and  the  rate  at  which  control  forces  are  applied 
are  the  same  as  the  basic  simulation  rate,  i.e.,  50  hz.. 


Results 


The  experiments  consisted  of  a  number  of  trials,  each  starting  with  the  cart-pole 
system  set  to  a  state  chosen  at  random,  and  ending  with  the  appearance  of  the  failure 
signal.  A  series  of  trials  constitutes  a  run,  with  the  first  trial  of  a  run  starting  with 
weights  initialized  to  random  values  between  -0.1  and  0.1.  We  want  the  learning 
system  to  learn  to  generate  actions,  F[tj,  that  maximize  the  number  of  time  steps 
between  occurrences  of  rjt|  =  -1.  The  only  information  available  to  the  system  is 
given  by  the  sequences  x,[<],  i  =  0, . . .  ,4,  and  r[f]. 


One-Layer  Experiments 


We  experimented  with  one-layer  networks  (no  hidden  units)  to  obtain  perfor¬ 
mance  measures  with  which  the  performance  of  the  two-layer  system  could  be  com¬ 
pared.  The  learning  rules  for  the  one-layer  networks  depend  on  the  three  parameters, 
p,  P,  and  7.  The  value  of  7  was  fixed  at  0.9,  while  different  values  of  p  and  fi  were 
crudely  optimized  (simulation  time  prevented  an  accurate  optimization)  by  perform¬ 
ing  2  runs  of  500,000  steps  each  for  approximately  25  different,  sets  of  parameter 
values.  Two  performance  measures  were  used  to  select  the  best  parameters.  The 
number  of  trials,  averaged  over  runs  with  one  set  of  parameter  values,  provides  a 
rough  measure  of  performance  over  the  length  of  a  run.  To  judge  how  well  the  so¬ 
lution  had  been  learned  by  the  end  of  the  run,  the  number  of  steps  in  the  last  trial, 
or  the  previous  trial,  whichever  is  larger,  is  averaged  over  all  runs.  In  this  way,  an 
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abnormally  short  final  trial  caused  by  the  termination  of  a  run  on  the  500,000th  step 
does  not  enter  into  the  average  of  final  trial  lengths. 


Performance  did  not  vary  considerably  for  the  parameter  values  that  were  tested. 
The  best  parameter  values  were  used  to  obtain  a  more  statistically-significant  result 
by  performing  10  runs  of  500,000  steps  each.  The  following  parameter  values  were 
used: 

0  =  0.05, 

P  =  0.5, 

1  =  0-9, 

resulting  in  the  number  of  failures  for  the  10  runs  shown  in  Table  14.  The  average 


Table  14:  Results  of  One-Layer  System 


Run 

Trials 

Last  Trial 

1 

33,977 

14 

2 

61,888 

4 

3 

24,795 

16 

4 

22,717 

130 

5 

28,324 

28 

6 

15,218 

100 

7 

31,594 

15 

8 

44,903 

9 

9 

16,115 

72 

10 

26,402 

14 

number  of  trials  for  each  run  is  approximately  30,593.  In  addition,  the  number  of 
steps  in  the  last  trial  is  shown  for  each  run.  As  explained  above,  this  value  is  actually 
the  larger  of  the  last  trial  length  and  the  previous  trial  length,  in  case  the  last  trial 
had  just  begun  when  the  run  was  terminated  at  step  500,000. 

The  number  of  steps  per  trial  versus  the  number  of  trials  is  plotted  in  Kig.  18. 
The  plotted  values  are  averages  over  the  10  runs  and  over  bins  of  100  trials,  i.e., 
the  trials  for  a  run  are  grouped  into  intervals  of  100  trials,  the  number  of  steps  per 
trial  is  averaged  for  each  interval,  and  the  results  are  averaged  over  the  runs.  The 
learning  curve  shows  that  performance  improves  with  experience — the  trial  length 
is  approximately  equal  to  10  steps  initially,  and  after  30,000  trials  approximately  30 
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Figure  18:  Balancing  Time  versus  Trials  for  One- Layer  System 


steps  occur  per  trial.  Even  with  little  experience,  the  learning  system  performs  better 
than  a  fixed  controller  that  selects  pushes  on  the  cart  at  random.  The  large  variance 
from  trial  to  trial  is  due  to  the  initialization  of  the  cart-pole  system  to  random  states 
upon  failure.  The  starting  state  of  each  trial  might  be  very  similar  to  a  failure  state, 
or  it  might  be  near  the  state  of  perfect  balance  where  [x,x,6,9)  =  (0,0, 0,0).  This 
method  of  restarting  after  failure  differs  from  that  used  in  our  earlier  work  described 
in  Ref.  [13],  where  we  started  the  cart-pole  system  at  state  (0,0, 0,0)  after  every 
failure. 

The  values  of  the  weights  at  the  end  of  each  run  varied  considerably.  The  best  of 
the  10  runs,  resulting  in  15,218  failures,  resulted  in  the  weights  that  are  displayed  on 
the  network  schematic  shown  in  Fig.  19.  Positive  weights  are  drawn  as  hollow  circles 
and  negative  weights  as  filled  disks.  The  magnitude  of  a  weight  is  proportional  to 
the  radius  of  its  circle,  or  disk.  From  the  size  of  the  weights  we  see  that  the  output 
of  the  evaluation  network  (Fig.  19a)  is  rather  insensitive  to  the  values  of  the  state 
variables,  and  the  value  of  the  output  is  always  negative.  The  output  of  the  action 
network  (Fig.  19,b)  does  depend  on  the  system’s  state.  A  large  6  has  a  positive  effect, 
producing  a  push  on  the  cart  to  the  right.,  and  a  large  value  for  x  has  a  negative  effect, 
pushing  the  cart  to  the  left. 

A  better  understanding  of  what  these  weights  mean  is  obtained  from  a  graph 
of  the  output  of  the  networks  versus  the  state.  To  display  these  functions  of  four 
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Figure  19:  Weights  Learned  by  One-Layer  Network 


variables,  we  generated  graphs  of  the  functions’  values  versus  9  and  9  for  nine  different 
pairs  of  x  and  x  values.  Fig.  20a  and  Fig.  20b  contain  such  graphs  for  the  evaluation 
network  and  the  action  network,  respectively.  The  insensitivity  of  the  evaluation 
network  to  the  state  is  evident  from  the  flat  surfaces  of  its  graphs.  The  base  plane 
in  these  graphs  does  not  represent  a  value  of  0;  the  surface  is  actually  at  a  small 
negative  value.  Obviously  this  function  serves  no  useful  role  as  an  evaluation  function 
for  states.  It  is  for  this  reason  that  the  one-layer  system  could  not  improve  its 
performance  over  30  steps  per  trial.  Credit  is  assigned  by  the  external  reinforcement 
signal  only  to  actions  that  push  the  cart-pole  into  a  failure  state  in  one  step.  These 
actions  may  not  be  responsible  for  the  failure  and  may  even  be  correct. 

Fig.  20b  shows  that  the  action  network  has  learned  a  function  with  approxi¬ 
mately  the  desired  shape  (Fig.  17b).  The  height  above  the  base  plane  represents  the 
probability  of  generating  a  push  to  the  right.  The  level  of  the  base  plane  is  at  zero 
probability,  so  for  states  where  the  surface  lies  near  the  base  plane,  a  push  to  the  left 
is  generated  with  high  probability.  The  middle  graph,  where  x  -  0  and  i  =  0,  shows 
a  smooth  transition  from  a  high  probability  of  pushing  left  to  a  high  probability  of 
pushing  right  as  9  and  9  go  from  negative  to  positive.  This  transition  is  shifted  in 
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the  direction  of  negative  6  for  negative  values  of  x  and  in  the  positive  0  direction 
for  positive  values  of  x.  In  relating  these  graphs  to  those  for  the  two-layer  networks, 
we  will  see  that  this  relationship  between  the  transition  line  and  the  value  of  x  is 
the  opposite  of  what  is  needed  to  balance  the  pole  while  avoiding  collisions  with  the 
track  boundaries. 

Results  of  Two-Layer  Experiments 

Two-layer  networks  were  formed  by  adding  5  hidden  units  to  each  of  the  evalua¬ 
tion  and  action  networks.  The  learning  methods  for  the  two-layer  networks  depend 
on  seven  parameters:  0 ,  0h,  0m,  P ,  Ph,  pm,  and  As  was  done  for  the  one-layer 
system,  sets  of  parameter  values  (approximately  10)  were  each  tested  in  5  runs  of 
500,000  steps.  Performance  varied  significantly  for  small  changes  in  parameter  values 
(*/  was  not  varied).  The  values  giving  the  best  performance  are: 

0  ■-  0.2, 

0h  0.2, 

0m  =  0, 

P  ~  0.5, 

Ph  0.5. 

Pm  -  0, 

'y  —  0.9. 

Notice  that  0m  —  pm  0.  Results  suggest  that  nonzero  momentum  in  the  learning 
rules  for  the  hidden  units  hinders  performance  on  this  task.  These  values  were  used 
for  10  runs  of  500,000  steps,  resulting  in  the  total  number  of  trials  and  final  trial 
lengths  shown  in  Table  15.  The  average  number  of  trials  over  ali  runs  is  approxi¬ 
mately  10,98.3.  compared  to  30,593  trials  for  the  one-layer  system.  Even  after  much 
learning  experience,  a  nonzero  probability  of  selecting  the  wrong  action  exists  for 
every  state,  as  suggested  by  the  relatively  small  number  of  steps  in  the  last  trials  of 
Runs  1  and  10. 

The  learning  curve  for  the  two-layer  system  is  shown  in  I'  ig.  21.  The  large,  stair- 
like  jumps  in  the  curve  are  due  to  the  way  in  which  performance  is  averaged  over  runs, 
described  as  follows.  The  axis  for  the  number  of  trials  is  labeled  from  0  to  30.(X)0 
trials,  so  runs  for  which  less  than  30,000  trials  occurred  were  handled  in  a  special 


Table  15:  Results  of  Two-Layer  System 


Run 

Trials 

Last  Trial 

l 

10,123 

88 

2 

7,790 

2,011 

3 

5,814 

14,535 

4 

8,466 

5,753 

5 

7,212 

28,407 

6 

23,539 

20,328 

7 

19,401 

14,302 

8 

8,804 

4,674 

9 

9,756 

20,889 

10 

9,645 
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Figure  21:  Balancing  Time  versus  Trials  for  Two-Layer  Systei 
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way.  The  learning  curve  for  such  a  run  is  extended  to  30, (KM)  trials  by  assigning 
to  trials  that  didn’t  occur  a  value  equal  to  the  larger  of  a)  the  number  of  steps  in 
the  last  trial,  and  b)  the  number  of  steps  in  the  previous  trial.  In  this  way,  a  very 
short  final  trial  is  disregarded  and  the  length  of  the  previous  trial  is  used  to  extend 
the  curve.  The  large  jumps  in  the  curve  occur  when  the  last  trial  of  a  particular 
run  is  very  long  and  the  run  is  terminated  when  500,000  steps  have  elapsed.  If  the 
experiments  were  to  be  run  for  more  steps,  the  final  performance  level  would  have 
been  higher.  This  large  number  of  steps  is  then  averaged  into  the  performance  curve 
from  that  trial  through  trial  30,000.  For  example,  the  jump  at  trial  20,000  is  due  to 
the  last  trial  of  run  7,  which  was  approximately  14,000  steps  in  length.  All  10  runs 
were  terminated  before  30,000  trials  elapsed,  resulting  in  a  final  performance  level 
of  about  I  1 ,0(K)  steps  per  trial.  Recall  that  the  final  level  achieved  by  the  one-layer 
system  was  only  about  30  steps  per  trial. 

The  large  number  of  weights,  35  in  each  network,  makes  it  difficult  to  interpret 
the  solutions  found  directly  from  the  weight  values.  The  relative  magnitudes  and 
signs  of  the  weights  are  shown  in  the  network  schematics  of  Fig.  22.  Figure  22a 
shows  the  final  weight  values  for  the  evaluation  network  of  run  6,  and  Fig.  22b  shows 
the  weights  for  the  action  network.  Units  1,  2,  4  and  5  of  the  evaluation  network  are 
similar,  having  all  positive  weights.  (In  the  figure,  the  small  size  of  the  corresponding 
circles  make  them  appear  to  be  filled-in  disks.)  Unit  3's  weights  differ,  and  it  is  also 
distinguished  by  having  a  large  positive  connection  to  the  output  unit.  It  appears 
that  only  unit  3  has  developed  a  new  feature  that  is  useful  for  the  prediction  of 
failure. 

The  function  implemented  by  the  evaluation  network  appears  in  Fig.  23a.  The 
height  of  the  surface  ranges  from  approximately  1.5  and  0.1.  Its  shape  is  just  what 
is  needed  for  the  action  network  to  receive  an  immediate  evaluation  of  an  action.  At 
the  center  of  each  base  plane,  representing  the  (0.0)  subspace,  the  cart-pole  is  in  a 
state  where  the  pole  is  vertical  and  not  falling.  The  evaluation  has  its  highest  value 
for  these  states,  therefore  forming  an  evaluation  function  that  decreases  as  the  cart- 
pole  system  moves  away  from  this  state.  Any  action  that  takes  the  system  toward 
either  the  positive  or  negative  (0,0)  corner  results  in  a  negative  evaluation  change. 


93 


i.e.,  a  negative  f.  The  tilt  of  the  surface  as  x  and  x  change  is  also  correct.  Positive 
{0,6)  states  are  more  likely  to  result  in  failure  when  the  cart  is  heading  toward  the 
right  border  of  the  track,  where  x  and  x  are  positive.  Similarly,  negative  {6,0)  states 
are  likely  to  precede  failure  when  the  cart  is  heading  to  the  left  border,  where  x  and 
x  are  negative. 

Before  discussing  the  features  learned  by  the  hidden  units,  let  us  look  at  the 
solution  learned  by  the  action  network.  From  Fig.  22b  we  see  that  again  hidden 
unit  3  differs  from  the  other  units  in  its  weight  values:  6  and  0  have  large  positive 
effects  on  unit  3’s  output,  and  x  and  x  have  smaller  positive  and  negative  effects, 
respectively.  Unit  3  is  connected  positively  to  the  output  unit,  whereas  the  other 
units  are  connected  negatively.  The  fact  that  the  unit  3s  of  both  networks  play 
significant  roles  is  fortuitous;  for  other  runs  useful  features  are  learned  by  different 
sets  of  units. 

These  hidden-unit  influences  in  combination  with  the  direct  effects  of  the  net¬ 
work’s  input  on  the  output  unit  result  in  the  action  function  displayed  in  Fig.  23b. 
Two  observations  can  be  made  about  the  contrast  between  this  action  function  and 
that  learned  by  the  one-layer  network  (Fig.  20b).  First,  the  transition  from  a  high 
probability  of  pushing  left  to  a  high  probability  of  pushing  right  is  much  quicker,  as  0 
and  6  vary.  This  probability  function  implements  a  much  more  deterministic  control 
than  does  that  of  the  one-layer  network.  Due  to  the  good  evaluation  function  learned 
by  the  evaluation  network,  actions  near  the  transition  line  are  credited  or  blamed 
appropriately.  A  second  observation  is  that  the  shift  in  the  transition  line  as  x  and 
x  vary  is  in  the  right  direction.  The  pole  should  be  balanced  slightly  to  the  right,  of 
vertical  (positive  6)  when  the  cart  is  near  the  left  track  boundary  (negative  x),  and 
to  the  left  of  vertical  when  near  the  right  boundary,  resulting  in  a  net  action  over 
several  steps  of  a  push  towards  the  center  of  t  he  track.  To  see  that,  this  is  indeed 
what  happens,  note  that  tin'  point  at  which  the  pole  is  balanced  is  roughly  indicated 
by  the  location  of  the  transition  line.  This  line  shifts  toward  positive  0  when  x  is 
negative,  and  toward  negative  6  when  .r  is  positive. 

Now  we  continue  with  the  analysis  of  the  hidden  units.  Unit  3  of  both  networks  ac¬ 
quired  significant  effects  on  the  respective  output  units.  T  he*  func  tions  implemented 
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by  their  weights  can  be  visualized  by  graphing  them  as  fund  ions  of  6  and  0  for  dif¬ 
ferent  values  of  x  and  x,  as  done  for  the  functions  implemented  by  an  entire  network. 
Figure  24a  shows  these  graphs  for  unit  3  of  the  evaluation  network  and  Fig.  24b 
shows  the  graphs  for  unit  3  of  the  action  network.  The  outputs  of  these  units  varies 
between  0  and  1.  Very  similar  functions  were  learned  by  the  two  units.  They  both 
produce  a  fairly  constant  value  of  1  for  most  states,  with  lower  values  approaching 
0  when  0  and  d  become  more  negative.  However,  the  contribution  of  unit  3  of  the 
action  network  is  very  small — its  output  weight  is  small  in  comparison  to  the  larger 
weights  on  the  output  unit,  unit  6.  This  is  not  surprising,  since  the  desired  mapping 
from  state  to  action  can  be  implemented  with  a  single  unit.  In  fact,  setting  unit  3’s 
output  weight  to  0  and  adding  its  magnitude  to  unit  6’s  constant-input  weight  causes 
little  change  in  the  state-to-action  mapping. 

To  test  the  significance  of  the  new  feature  learned  by  unit  3  of  the  action  network, 
further  experiments  were  run  with  a  one-layer  action  network  and  the  two-layer 
evaluation  network.  The  one-layer  action  network  did  learn  the  desired  function, 
but  it  learned  it  more  slowly  than  did  the  two-layer  action  network.  Perhaps  the 
feature  learned  by  unit  3  facilitated  the  learning  of  a  good  action  function,  and 
with  additional  experience  the  output  unit  developed  the  appropriate  weights  for  its 
state-variable  inputs.  This  must  be  verified  by  observing  the  evolution  of  the  action 
function  both  as  a  function  of  the  state  variables  and  as  a  function  of  the  hidden 
units’  outputs. 

The  role  of  unit  3  in  the  evaluation  network  is  much  more  important.  The  hill¬ 
shaped  evaluation  surface  cannot  be  implemented  without  the  hidden  units,  as  shown 
by  the  results  of  the  single-layer  experiments.  Through  its  positively-weighted  con¬ 
nection  to  the  output  unit,  unit  3  generates  the  positive  gradient  in  the  evaluation 
surface  as  one  moves  from  negative  OavciO  to  0  0  0.  At  this  point,  the  gradient 

in  the  response  of  unit  3  effectively  becomes  zero,  and  the  output  unit's  negative 
weights  from  its  state- variable  inputs  provide'  the  negative  slope  as  one  moves  in  the 
positive  0,  0  direct  ion . 
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Conclusion 


It  is  immediately  apparent  from  the  learning  curves  of  Figs.  18  and  21  that  the 
two-layer  learning  system  far  outperformed  the  one-layer  learning  system.  New  fea¬ 
tures  are  required  for  representing  a  good  evaluation  function  and,  although  they  are 
not  required  for  representing  a  good  action  function,  new  features  facilitate  learning 
the  action  function.  Thus,  the  synthesis  of  the  error  back-propagation  scheme  of 
Rumelhart,  Hinton,  and  Williams  [44)  and  reinforcement-learning  techniques  pro¬ 
duced  an  adaptive  network  that  successfully  deals  with  delayed  reinforcement  and 
the  initial  lack  of  an  adequate  representation.  The  learning  method  resulted  in  a 
controller  that  balanced  the  pole  for  9  minutes  (simulated  time — 28,000  steps  at  0.02 
seconds  per  step)  and  probably  would  have  balanced  it  longer  if  the  experiments  had 
been  run  for  a  greater  number  of  steps. 

Comparison  with  the  single-layer  system  of  Barto  et  al.  1 1 3)  is  made  difficult 
by  the  differences  in  how  the  experiments  were  conducted.  The  difference  with  the 
greatest  effect  is  that  in  the  experiments  described  in  Ref.  [13],  the  cart-pole  system 
was  always  restarted  in  the  same  state,  (xyx,0,0)  =-  (0,0, 0,0),  following  failure, 
whereas  here  the  start  state  was  selected  randomly.  Average  performance  is  kept 
low  by  start  states  that  are  very  close  to  failure  states.  Disregarding  this  difference, 
comparisons  show  that  the  previous  system  achieved  a  higher  average  trial  length 
after  500,000  steps,  80,000  steps  per  trial,  while  the  experiments  here  resulted  in 
approximately  30,000  steps  per  trial.  This  difference  reflects  the  fact  that  the  two- 
layer  system  learned  good  solutions  later  in  the  runs  than  did  the  system  of  Ref.  1 1 3) . 
We  conclude  that  a  considerable  number  of  steps  are  required  for  the  hidden  units 
to  learn  the  necessary  features.  It  is  not  until  good  features  are  learned  that  a  viseful 
evaluation  function  can  be  formed,  and  until  the  evaluation  function  is  learned,  the 
action  network  cannot  improve  beyond  a  low  level  of  performance. 


Learning  to  Solve  a  Puzzle 


Described  in  this  section  are  experiments  in  applying  a  strategy  learning  network 
as  described  in  Section  5  to  the  Tower  of  Hanoi  puzzle.  Because  the  state-space 
concept  underlying  the  learning  system  for  the  pole-balancing  problem  also  applies 
to  problem-solving  tasks  that  have  been  studied  by  Artificial  Intelligence  (A I)  re¬ 
searchers,  it  is  possible  to  use  the  same  kind  of  learning  methods  for  both  types  of 
problems.  Applying  this  kind  of  learning  method  to  a  problem  like  the  Tower  of  Hanoi 
puzzle  is  very  different  from  applying  one  of  the  more  knowledge-intensive  learning 
methods  of  AI,  and  the  results  may  not  really  be  comparable.  The  knowledge- 
intensive  approach  is  probably  closer  to  how  a  human  might  learn  to  solve  a  puzzle 
like  this,  whereas  the  network  method  seems  more  similar  to  the  acquisition  of  a 
motor-skill.  The  primary  purpose  in  applying  the  network  method  to  this  puzzle 
was  to  provide  a  good  test  of  the  multilayer  strategy  learning  network.  Nevertheless, 
we  still  make  some  comparisons  of  methodology  with  Langley’s  [31]  adaptive  produc¬ 
tion  system,  called  SAGE,  which  learns  heuristics  that  improve  the  performance  of 
an  initial  weak  search  strategy.  To  facilitate  this  comparison,  we  selected  the  three- 
disk  Tower  of  Hanoi  puzzle  for  our  experiments,  one  of  the  puzzles  that  Langley  used 
to  demonstrate  SAGE. 


The  Tower  of  Hanoi  Puzzle 


The  Tower  of  Hanoi  puzzle  is  popular  for  research  in  problem-solving  because  (lie 
number  of  states  is  small,  but  the  puzzle  is  still  difficult  to  solve.  Human  strategies 
for  solving  the  Tower  of  Hanoi  have  been  analyzed  |32|  and  modeled  (5|.  Arnarel  |2| 


used  the  Tower  of  Hanoi  puzzle  as  a  vehicle  for  studying  shifts  of  representations  to 
forms  of  increasing  efficiency  for  the  discovery  of  a  problem’s  solution. 

The  state  of  the  Tower  of  Hanoi  puzzle  can  be  represented  in  a  number  of  ways.  A 
common  representation  is  one  used  by  Nilsson  [38],  in  which  the  pegs  are  numbered 
1,  2,  and  3,  and  the  disks  are  labeled  A,  B,  and  C,  where  disk  A  is  the  smallest  disk 
and  C  is  the  largest.  A  particular  state  is  represented  by  the  peg  numbers  where 
each  disk  resides,  listed  for  disk  C,  then  disk  B  and  disk  A.  As  pictured  in  Fig.  25, 
the  initial  state  of  the  puzzle  is  (ill),  and  the  objective  is  to  achieve  state  (333)  by 
applying  a  sequence  of  actions.  The  legal  actions  are  movements  of  the  top-most 

Initial  State  Goal  State 


(111)  f333) 

Figure  25:  Initial  and  Goal  States  of  the  Tower  of  Hanoi  Puzzle 

disk  from  one  peg  to  another,  with  the  restriction  that  a  disk  may  never  be  placed 
upon  a  smaller  disk.  An  action  may  be  represented  as  a  source  peg  and  destination 
peg,  so  the  transformation  of  state  (ill)  to  state  (112)  is  performed  by  the  action  of 
moving  the  top-most  (smallest)  disk  from  peg  1  to  peg  2,  represented  by  action  1-2. 
For  the  three-peg  puzzle,  six  actions  are  needed:  1-2,  1-3,  2-1,  2-3,  3-1,  and  3-2. 

The  states  of  the  puzzle  plus  the  transitions  between  the  states  corresponding 
to  the  legal  actions  form  the  puzzle’s  state  transition  graph  shown  in  Fig.  26.  To 
evaluate  ability  to  improve  search  strategies  on  the  Tower  of  Hanoi  puzzle,  we  measure 
the  number  of  actions  in  the  solution  path,  with  the  minimum  length  path  being  the 
objective.  For  the  three-disk  puzzle,  the  minimum-length  solution  path  has  seven 
actions  and  is  the  straight  path  down  the  right  side  of  the  state  transition  graph. 
Finding  the  shortest  solution  path  is  confounded  by  the  large  number  of  possible 
solution  paths  and  by  the  presence  of  loops  in  the  state  transition  graph. 

The  Tower  of  Hanoi  puzzle  is  a  good  test  of  the  multilayer  network  described  in 
Section  5  for  the  following  reason.  A  useful  evaluation  function  must  map  states  to 
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Figure  26:  State  Transition  Graph  for  Three-Disk  Tower  of  Hanoi  Puzzle 

(Adapted  from  Nilsson  [38]) 
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values  that  indicate  the  states’  closeness  to  the  goal  node.  For  the  state  representation 
described  above,  this  mapping  cannot  be  formed  by  a  linear  threshold  unit,  or  other 
unit  based  on  a  weighted  sum  of  its  inputs.  For  the  experiments  described  in  this 
section  the  representation  was  simplified  somewhat,  as  described  later,  to  reduce  the 
time  required  to  learn  the  solution.  The  fact  that  new  features  are  still  required  is 
shown  by  the  inability  of  a  one-layer  system  to  learn  the  minimal  solution  path. 

The  formation  of  useful  search  heuristics  for  the  Tower  of  Hanoi  puzzle  is  less 
complicated.  A  small  set  of  rather  simple  heuristics  can  constrain  the  search  to 
exactly  the  correct  actions  [5,31].  For  example,  many  alternatives  are  removed  by 
a  rule  stating  that  it  is  undesirable  to  apply  the  inverse  of  an  action.  The  action 
network  used  to  learn  search  heuristics  in  our  experiments  is  single-layered,  and  did 
successfully  learn  the  minimal  solution  path.  However,  it  did  not  do  this  by  learning 
the  simple  set  of  heuristics.  We  discuss  this  point  later  when  we  compare  the  network 
method  to  Langley’s  production  system  (31  j. 

Representation  of  States  and  Actions 

As  mentioned  above,  the  state  representation  consisting  of  three  peg  numbers 
corresponding  to  each  disk  results  in  a  very  complex  mapping  from  states  to  eval¬ 
uations.  Although  in  principle  the  networks  used  here  should  be  able  to  learn  this 
mapping,  we  wished  to  simplify  the  task  somewhat  to  reduce  the  simulation  time 
required  for  the  experiments.  The  state  representation  used  in  the  following  experi¬ 
ments  is  composed  of  nine  binary  digits.  The  first  three  digits  encode  Disk  C’s  peg 
number,  the  second  encode  Disk  B’s  peg,  and  the  third  set  of  three  digits  encode 
Disk  A’s  peg.  Peg  1  is  encoded  as  100,  Peg  2  as  010,  and  Peg  3  as  001.  For  example, 
state  (111)  is  represented  as  (100  100  100),  and  state  (123)  is  represented  as  (100 
010  001).  We  also  use  a  constant  input  of  value  0.5  so  the  threshold  can  be  varied. 


Specifically,  the  input  terms,  x,(<],  for  the  state  at  time  t  are  given  by: 


x0[<|  =  0.5, 


*i[*j  = 


Xi\t\  = 


*s(<|  = 


j  1,  if  Disk  C  is  on  Peg  1  at  time  t\ 
|  0,  otherwise, 

(1,  if  Disk  C  is  on  Peg  2  at  time  t\ 
0,  otherwise, 

j  1,  if  Disk  C  is  on  Peg  3  at  time  t\ 
|  0,  otherwise, 


and  similarly  for  x4(<],  x5[i],  x6[£]  and  Disk  B,  and  for  j ,  x8[t],  x9[l]  and  Disk  A. 
After  the  goal  state  is  reached  and  at  the  start  of  every  run,  the  state  is  set  to  (111), 
so  the  input  becomes  (100  100  100)  disregarding  the  constant  input. 

Both  the  evaluation  network  and  the  action  network  receive  the  representation  of 
the  state.  This  completely  defines  the  input  to  the  evaluation  network,  but  additional 
terms  are  presented  to  the  action  network.  We  wished  to  investigate  the  ability  of 
the  network  to  learn  search  heuristics  similar  to  the  rules  developed  by  Langley’s 
SAGE  system.  As  mentioned  above,  one  such  rule  is  to  never  apply  the  inverse  of 
the  previous  action.  In  order  to  learn  such  an  association  between  the  previous  action 
and  the  current  action,  the  previous  action  must  be  provided  as  input  to  the  action 
network.  Another  rule  learned  by  SAGE  is  to  not  apply  an  action  that  returns  the 
puzzle  to  a  state  that  was  visited  two  steps  ago.  This  avoids  the  three-step  loops 
around  the  smallest  triangles  in  the  state  transition  graph  (Fig.  26).  Rather  than 
providing  past  states  as  input,  we  chose  to  present  the  action  taken  two-steps  ago,  in 
addition  to  the  previous  action.  The  previous  two  actions  along  with  the  current  state 
provide  enough  information  to  identify  the  state  visited  two  steps  ago,  although  our 
results  suggest  that  hidden  units  are  needed  to  overcome  the  linearity  of  the  output 
unit,  perhaps  by  forming  a  conjunction  of  (he  previous  two  actions.  This  possibility 
was  not  investigated. 

The  output  of  the  action  network  represents  an  action  by  a  six-component,  stan¬ 
dard  unit  basis  vector,  where  the  components  correspond  to  actions  1-2,  1-3,  2-1, 2-3, 
3-1,  and  3-2,  respectively;  Action  1-2  is  encoded  as  (100000),  Action  1-3  is  (010000), 
and  so  on. 


Letting  the  action  at  timet  be  denoted  by  (ai[t],aj[t), . . .  ,  ae]*)),  the  input  terms 
that  the  action  network  receives  in  addition  to  those  also  received  by  the  evaluation 
network  are: 

*io|*|  = 

ii5[t)  =  a,[t-2], 

and 

iicft]  =  a6[t  -  1], 

*2l|<l  = 

Reinforcement 

The  most  significant  reinforcement  occurs  whenever  the  goal  state  is  entered.  A 
reinforcement  value,  labeled  r[tj,  of  1  is  presented  for  the  time  step  at  which  the 
goal  state  (333)  is  entered.  Recall  that  for  the  pole-balancing  task,  the  goal  is  to 
avoid  certain  states  for  as  long  as  possible,  and  rjt]  was  set  to  -1  upon  entering 
those  states.  The  Tower  of  Hanoi  task  could  be  solved  (by  a  two-layer  network)  with 
only  this  final  reinforcement,  but  two  additional  reinforcements  are  provided  for  the 
following  reasons.  If  the  action  probabilities  converge  too  quickly,  due  to  a  large  value 
for  the  parameter  p,  a  solution  path  of  longer  than  minimum  length  will  probably 
he  learned.  For  example,  say  the  learned  solution  path  is  of  length  eight,  one  step 
longer  than  the  minimum  number,  due  to  the  incorrect  Action  1-2  being  taken  from 
the  starting  state  (111).  If  this  action  is  always  chosen  over  the  correct  Action  1-3, 
then  an  evaluation  function  tailored  to  this  particular  solution  path  will  be  learned. 
To  avoid  this,  a  second  reinforcement  signal  is  presented  having  a  constant  value 
of  0.1  for  all  non-goal  states.  In  this  way,  a  shorter  solution  path  results  in  a 
higher  total  reinforcement  than  does  a  longer  solution  path.  This  parallels  the  role 
of  Langley's  heuristic  for  judging  shorter  paths  between  two  states  as  more  desirable. 

The  third  reinforcement  is  a  value  of  1  presented  whenever  a  two-step  loop 
occurs,  i.e.,  when  the  current  action  reverses  the  effect  of  the  previous  action.  The 
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random  search  initially  followed  by  the  action  network  results  in  many  two-step 
loops  (and  longer  loops),  thus  many  steps  elapse  before  the  goal  is  discovered.  The 
large  negative  reinforcement  results  in  a  significant  decrease  in  the  probability  of 
selecting  the  action  that  is  the  inverse  of  the  previous  action.  As  shown  later,  this 
must  be  learned  for  each  action — the  concept  of  inverse  actions  is  not  known  to 
the  system,  so  generalizing  across  actions  is  not  possible.  This  reinforcement  is 
not  presented  to  the  evaluation  network,  enabling  a  large  negative  reinforcement  to 
be  used  without  decreasing  the  evaluation  for  the  corresponding  state.  The  large 
negative  reinforcement  is  not  meant  to  indicate  that  a  state  is  bad,  only  that  the 
action  was  bad.  The  selection  of  the  inverse  action  should  be  discouraged,  but  not 
necessarily  the  visitation  of  the  state. 

This  third  type  of  reinforcement  is  not  necessary  for  the  system  to  learn  the  puz¬ 
zle’s  solution.  It  was  included  for  two  reasons.  First,  it  does  significantly  reduce  the 
number  of  search  steps  during  the  early  stages  of  learning.  Second,  it  demonstrates 
how  domain  knowledge,  such  as  the  undesirability  of  one-step  loops,  can  be  added 
by  altering  the  reinforcement  function. 

The  value  of  rjd]  includes  just  the  first  two  kinds  of  reinforcement,  while  the  one- 
step  loop  penalty  is  given  by  rloop(<]  to  distinguish  between  the  reinforcements  that 
are  and  are  not  presented  to  the  evaluation  network: 


1.0,  if  state  at  time  t  is  (333); 
-0.1,  otherwise. 


Hoopi^  | 


/  10, 

1  0.0, 


if  state  at  t  equals  state  at  i 
otherwise. 


The  learning  methods  are  very  similar  to  those  used  for  the  pole-balancing  task, 
with  small  modifications  to  the  network  structure  and  the  learning  rules.  The  only 
modification  to  the  learning  rules  used  for  the  pole-balancing  task  involves  the  equa¬ 
tion  for  updating  the  weights  of  the  action  network.  The  reinforcement  signal  for 
one-step  loops,  r|onp,  is  added  as  follows: 

1  =  -  *1  +  />(nooP[<|  +  r[fj)  (a,\t  -  l|  -  £{«,[<  -  l||u;;i})  x,[t  l|. 


Results  of  One-Layer  Experiments 


As  was  done  for  the  pole-balancing  experiments,  the  performance  of  a  system 
with  a  one-layer  evaluation  network  was  compared  to  the  performance  of  a  system 
with  a  two-layer  evaluation  network.  The  systems  with  the  one  and  two-layer  eval¬ 
uation  networks  are  shown  in  Figs.  28  and  31  respectively.  As  in  the  pole-balancing 
experiments,  two  performance  measures  were  used  to  select  the  best  values  for  the 
parameters  of  the  weight-update  equations.  A  measure  of  cumulative  performance 
throughout  a  run  is  provided  by  the  number  of  trials  (achievements  of  the  goal  state) 
averaged  over  all  runs  for  a  given  set  of  parameter  values.  The  second  performance 
measure  is  the  average  over  all  runs  of  the  number  of  steps  in  the  last  trial,  or  the 
preceding  trial,  whichever  is  smaller. 

The  final  performance  level  averaged  over  5  runs  of  50,000  steps  each  was  used 
to  select  the  best  of  approximately  20  sets  of  parameter  values,  differing  in  p  and  0 , 
leaving  -y  =  0.9.  The  best  of  these  values  are: 

0  =  0.100, 

p  =  0.01, 

qr  =  0.9. 

These  values  were  used  in  a  longer  experiment  of  10  runs  of  100,000  steps  each, 
resulting  in  the  number  of  trials  and  last  trial  lengths  shown  in  Table  16.  The 


Table  16:  Results  of  One-Layer  System 


Run 

Trials 

Last  Trial 

1 

2,911 

28 

2 

2,877 

23 

3 

2,884 

19 

4 

2,893 

05 

5 

2,680 

651 

6 

2,928 

25 

7 

4,481 

9 

8 

2,902 

25 

9 

3,951 

38 

10 

2,940 

41 

107 


•  w-v-v-  a-  --  v 


-■  V  •• 

-H-T-JL. 


d*i  11  Ai  miin  irfi  1  nfi  1  «*n 


average  number  of  trials  is  3, 145.  From  the  lengths  of  the  last  trial  for  each  run 
we  see  that  the  minimal  solution  path  was  not  learned  in  any  run — all  trials  are 
longer  than  seven  steps.  Run  7  resulted  in  a  last  trial  of  length  nine,  but  it  wasn’t 
determined  whether  the  path  taken  on  the  final  trial  would  be  reliably  followed  for 
subsequent  trials.  The  low  number  of  total  trials  for  Run  7  indicates  that  paths 
longer  than  nine  steps  are  likely. 

The  trial  length  versus  the  number  of  trials  is  plotted  in  Fig.  27,  showing  how  the 
length  of  the  solution  path  varies  with  experience.  The  horizontal  dotted  line  in  the 
figure  is  at  a  trial  length  of  seven,  the  length  of  the  minimal  solution  path.  The  values 
plotted  are  averages  over  the  10  runs  and  over  bins  of  100  trials.  The  length  of  the 


solution  path  decreases  quickly  from  an  average  of  50  steps  to  approximately  30  steps, 
but  performance  is  never  much  better  than  30  steps  per  solution.  A  non-learning 
strategy  of  random  action  selection  was  found  to  result  in  an  average  of  140  steps 
per  solution  path,  so  the  one-layer  system  significantly  improves  the  initial  random 
search  strategy.  Note  that  all  runs  were  terminated  before  5,000  trials  elapsed — the 
learning  curve  was  extended  as  was  done  for  the  pole-balancing  experiments.  The 
curve  might  have  continued  to  decrease  slightly  if  the  one-layer  experiments  had  been 
run  longer. 

The  weights  learned  by  the  end  of  Run  7  are  shown  in  Fig.  28.  The  evaluation 
network  has  acquired  only  three  weights  of  significant  magnitude,  and  they  are  all 
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associated  with  the  encoding  of  Disk  C’s  peg.  The  weights’  signs  result  in  high  eval¬ 
uations  for  states  with  Disk  C  on  Peg  3,  the  goal  peg,  and  low  evaluations  for  other 
states.  The  weights  of  the  action  network  are  more  difficult  to  interpret.  Let  us  post¬ 
pone  the  discussion  of  these  weights  until  the  results  of  the  subsequent  experiments 
with  a  two-layer  evaluation  network  are  presented. 

We  can  visualize  the  learned  evaluation  function  by  drawing  at  each  node  in  the 
state  graph  a  circle  with  radius  proportional  to  the  state’s  evaluation.  The  evaluation 
function  learned  in  Run  7  is  shown  in  this  manner  in  Fig.  29.  As  determined  from  the 


Initial  State 


Goal  State 

Figure  29:  Evaluation  Function  Learned  by  One-Layer  Network 

signs  of  the  weights,  the  evaluation  function  indeed  produces  high  values  for  states 
for  which  Disk  C  is  on  Peg  3,  which  are  the  states  in  the  large,  lower  right  triangle 
of  the  state  transition  graph.  There  is  very  little  additional  information  provided 
by  this  evaluation  function.  We  can  describe  this  function  as  a  credit-assignment 
heuristic,  viz.,  states  with  Disk  C  on  Peg  3  are  desirable. 

Results  of  Two- Layer  Experiments 

Our  two-layer  experiments  involved  a  two-layer  evaluation  network  with  10  hidden 
units,  but  with  the  same  one-layer  action  network  as  above.  We  suspected  that  with 
the  delayed  actions  as  input  terms,  the  one-laver  action  network  could  find  weight 
values  that  result  in  the  minimal  solution  path.  This  is  verified  by  the  results  of  the 
experiments. 
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Approximately  20  sets  of  parameter  values  were  tested  by  performing  5  runs  of 
50,000  steps  each.  The  values  resulting  in  the  best  performance  are:  < 

/ 3  =  0.1 

0k  =  2.0 

0m  =  0.9 

p  =  0.02 

-7  =  0.9 

Momentum  was  discovered  to  facilitate  learning  in  this  case,  though  interestingly 
it  retarded  learning  for  the  pole-balancing  task.  Applying  these  values  in  10  runs 
of  100,000  steps  produced  the  results  in  Table  17.  For  all  but  one  run  the  length 


Table  17:  Results  of  Two-Layer  System 


Run 

1 

2 

3 

4 

5 

6 

7 

8 
9 

10 


Trials 

11,809 

11,418 

11,584 

11,559 

11,967 

12,093 

11,856 

11,636 

12,432 

12,041 


Last  Trial 


i 

22 

7 


of  the  last  trial  was  seven  steps,  equal  to  the  length  of  the  minimal  solution  path. 
The  average  number  of  trials  is  11,839,  roughly  10  steps  per  trial  averaged  over  the 
100,000  steps.  So  in  9  out  of  10  runs  the  minimal  solution  path  was  learned,  and 
judging  from  the  number  of  trials  in  Run  2,  the  minimal  solution  path  was  probably 
reliably  followed  in  that  run,  also.  There  is  always  a  nonzero  probability  of  trying 
an  alternative  path,  which  could  explain  the  last  trial  of  Run  2. 

The  learning  curve  of  Fig.  30  shows  that  the  two-layer  system  quickly  learned 
solution  paths  averaging  about  15  steps  in  length,  and  gradually  reduced  this  to  the 
minimum  of  -even  steps.  The  learning  curve  for  the  one-layer  system  is  superimposed 
on  this  graph  to  highlight  the  performance  increase  resulting  from  the  hidden  units 
in  the  evaluation  network. 
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Figure  30:  Length  of  Solution  Path  versus  Trials  for  Two-Layer  System 

The  weights  learned  during  Run  1  are  displayed  in  Fig.  31.  First  we  focus  on  the 
weights  of  the  evaluation  network.  The  hidden-unit  weights  are  more  varied  than  they 
were  for  the  pole-balancing  task.  Most  of  the  units  appear  to  have  acquired  useful 
new  features.  Units  1,  2,  5,  6,  and  9  have  weights  of  large  magnitude,  though  they 
are  by  no  means  the  only  units  of  significance.  As  is  usually  the  case,  it  is  difficult  to 
comprehend  what  role  the  units  play  by  studying  individual  weight  values.  However, 
by  encoding  their  output  values  by  the  size  of  circles  on  the  state  transition  graph, 
as  done  earlier  for  the  evaluation  function  itself,  we  can  learn  exactly  what  the  new 
features  are  and  can  gain  some  intuitions  about  their  contributions  to  the  overall 
evaluation  function.  First,  we  analyze  the  evaluation  function  learned  in  Run  1. 

The  value  of  the  evaluation  function  learned  in  Run  1  is  represented  in  Fig.  32. 
In  comparing  two  states,  the  state  with  the  larger  circle  would  be  evaluated  as  being 
more  desirable.  Notice  that  a  consistent  progression  from  small  circles  to  larger 
circles  results  as  one  moves  from  any  state  toward  the  goal  state  by  the  shortest 
route,  thus  this  evaluation  function  is  extremely  informative.  Any  search  strategy, 
in  addition  to  the  probabilistic  method  used  to  generate  actions,  would  benefit  from 
th  is  evaluation  function. 

Now  let  us  see  how  this  evaluation  function  is  constructed.  Figure  33  shows  the 
output  functions  for  the  10  hidden  units,  i.e.,  the  features  acquired  during  learning. 
The  radii  of  the  circles  for  a  feature  are  calculated  by  scaling  the  27  output  values  for 
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Figure  33:  New  Features  Learned  by  Two-Layer  Evaluation  Network 


I 
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the  corresponding  hidden  unit  to  be  between  0  and  a  maximum  radius.  So  circles  of 
extreme  size  do  not  necessarily  indicate  that  the  output  is  0  or  1,  but  only  that  the 
output  is  a  minimum  or  maximum  of  the  values  for  that  unit.  We  will  not  attempt 
to  explain  every  feature,  but  will  consider  the  contributions  of  several.  We  refer  to  a 
feature  by  the  corresponding  unit  number,  such  as  Feature  1  for  the  function  learned 
by  Unit  1. 

Feature  1  has  a  positive  effect  on  the  evaluation.  Figure  33  shows  that  Feature  1 
roughly  represents  three  states  near  the  bottom  of  the  graph  just  outside  of  the 
lower  right  triangular  region  where  Disk  C  is  on  the  goal  peg.  Feature  1  boosts  the 
evaluation  of  these  states,  thus  directing  a  search  from  states  in  the  lower  left  part  | 

of  the  graph  toward  the  state  through  which  the  search  must  pass  to  get  Disk  3  onto 
the  goal  peg.  This  part  of  the  graph  is  a  “bottleneck” ,  and  similar  bottlenecks  exist 
at  the  other  two  junctions  of  the  three  largest  triangles.  The  values  of  the  evaluation 
function  are  critical  near  these  bottlenecks — the  choice  of  an  incorrect  action  can 
result  in  many  additional  moves  to  return  to  the  bottleneck  to  try  a  different  action. 

Features  1  and  2  seem  to  be  particularly  helpful  in  evaluating  the  lower  bottleneck 
and  the  bottleneck  on  the  right,  respectively. 

Feature  9  has  a  very  strong  negative  influence  on  the  evaluation  function.  The 
value  of  Feature  9  is  high  for  all  states  except  the  first  four  states  on  the  minimal 
solution  path,  plus  one  nearby  state.  The  evaluations  of  the  first  states  in  the  solution 
path  are  raised  in  relation  to  the  evaluations  of  the  other  states,  thus  Feature  9’s 
role  is  to  make  the  first  few  states  of  the  minimal  solution  path  more  desirable  than 
states  next  to  the  path. 

Feature  10  also  has  a  negative  effect.  Mainly  the  evaluations  of  states  along  and 
next  to  the  minimal  solution  path  are  lowered  by  Feature  10,  with  the  exception  of 
the  very  last  state  before  the  goal  state.  It  appears  that  this  feature  guarantees  that 
the  difference  in  state  evaluations  is  positive  as  the  last  state  is  reached. 

Other  features  also  have  important  contributions.  Perhaps  a  good  way  to  under¬ 
stand  their  roles  is  to  observe  changes  in  the  evaluation  function  as  each  feature  is 
removed  and  then  restored.  From  the  small  amount  of  analysis  done  here,  it  is  clear 
that  a  variety  of  new  features  were  developed  for  this  task.  The  initial  representation 
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of  the  state  is  far  from  ideal  for  forming  the  evaluation  function,  but  the  combination 
of  the  error  back-propagation  algorithm  with  Sutton’s  AHC  algorithm  successfully 
learned  sufficient  new  features  for  the  state  representation. 

We  now  discuss  the  action  function  that  was  learned.  When  the  two-layer  eval¬ 
uation  network  was  used,  the  single-layer  action  network  was  able  to  constrain  the 
search  to  the  minimal  solution  path.  Figure  31b  shows  that  this  was  accomplished 
mainly  through  the  development  of  weight  values  for  the  previous-step’s  action  and 
for  the  current  state.  The  two-step  delayed  action  did  not  acquire  a  significant  effect 
on  the  selection  of  the  current  action. 

The  large  negative  weights  on  the  previous  action  components  stand  out.  Note 
that  there  is  one  large  negative  weight  for  each  component.  These  weights  have  the 
same  effect  as  did  Langley’s  heuristic  for  preventing  the  application  of  the  inverse  of 
the  previous  action.  By  tracing  the  delayed  output  of  each  unit  to  the  corresponding 
negative  weight,  we  find  that  the  negative  weights  are  on  the  intersections  of  actions 
and  their  inverses.  In  other  words,  Action  1-2,  or  ai[f  -  1],  has  a  negative  connection 
to  Action  2-1,  or  a3(<],  Action  1-3  is  negatively  connected  to  3-1,  etc.  A  negative 
weight  lowers  the  probability  that  the  corresponding  action  will  be  generated,  and 
these  weights  are  of  such  high  magnitudes  that  the  probability  of  the  previous-action’s 
inverse  is  effectively  zero. 

There  are  other  weights  associated  with  the  previous-action  inputs  that  are 
positive-valued.  Through  these  weights,  the  generation  of  an  action  on  one  step 
results  in  a  high  probability  for  a  particular  action  on  the  next  step,  thus  forming 
two-step  sequences.  For  example,  Action  3-2  will  be  followed  by  Action  1-3.  Referring 
back  to  Fig.  26,  this  two-step  sequence  can  change  the  puzzle  from  the  third  state 
on  the  minimal  solution  path  to  the  fifth  state.  Other  sequences  exist  for  other  two- 
step  transitions  along  the  minimal  solution  path,  and  for  moving  onto  the  minimal 
solution  path. 

The  weights  on  the  delayed  action  components  are  not  sufficient  in  themselves 
for  limiting  actions  to  movement  along  the  minimal  solution  path.  The  current  state 
must  at  least  play  a  role  in  selecting  the  first  action.  Consider  the  values  of  the  input 
terms  when  in  the  start  state  (ill).  All  of  the  delayed-output  terms  are  0,  since 


this  is  the  first  step  in  the  trial.  All  other  input  terms  are  0,  except  for  the  first 
term  of  each  of  the  three  triples  encoding  the  state.  The  first  of  these  is  connected 
positively  to  Unit  2  and  negatively  to  the  rest,  except  for  Unit  6,  whose  action  is  not 
legal  for  the  start  state.  The  other  two  non-zero  input  terms  have  small  or  negative 
connections  to  units  having  legal  actions,  so  Unit  2  will  be  the  unit  to  respond  to  the 
start  state.  Unit  2  represents  Action  1-3,  the  first  action  along  the  minimal  solution 
path.  Langley’s  system  was  not  required  to  learn  the  correct  action  for  the  initial 
state,  because  both  states  (333)  and  (222)  were  goal  states — two  minimal  solution 
paths  exist,  and  both  actions  from  the  initial  state  (111)  move  along  one  of  the  paths. 

Transfer  of  Learning 

It  is  desirable  for  a  learning  system  to  be  able  to  improve  its  performance  on  a 
single  task,  called  improvement,  and  also  to  improve  performance  over  a  set  of  tasks, 
called  transfer.  Langley  [31)  lists  the  following  four  kinds  of  transfer  between  tasks: 

1.  Transfer  to  more  complex  versions  of  the  task. 

2.  Transfer  to  different  initial  states  or  goal  states. 

3.  Transfer  to  tasks  of  similar  complexity  with  different  state-space  structures. 

4.  Transfer  to  tasks  of  little  similarity,  perhaps  requiring  some  of  the  same  actions 
(referred  to  as  learning  by  analogy). 

The  ability  of  the  network  of  this  chapter  to  perform  the  first  two  kinds  of  transfer 
are  discussed  below. 

Langley  showed  that  the  heuristics  learned  by  SAGE  for  solving  the  three-disk 
Tower  of  Hanoi  puzzle  were  directly  applicable  to  the  four-disk  and  the  five-disk 
versions  of  this  puzzle,  solving  these  more  complex  puzzles  with  no  additional  search. 
The  representation  of  the  rules’  conditions  and  actions  made  this  possible:  disk,  peg, 
and  action  names  are  generalized  to  variables,  therefore  the  rules  could  be  applied 
to  the  new  task  having  an  additional  disk,  since  its  name  could  be  bound  to  a 
variable.  In  addition,  the  concept  of  an  action’s  inverse  is  included  in  the  system's 
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i  representation,  enabling  a  situation  where  an  action  is  followed  by  its  inverse  to  result 

> 

in  a  rule  discouraging  the  use  of  the  inverse  of  any  action. 

The  connectionist  representation  used  here  does  not  permit  such  generalizations, 
although  some  learned  knowledge  is  transferred  to  Tower  of  Hanoi  puzzles  having 
i  more  disks.  If  the  input  representation  of  the  networks  is  augmented  by  adding 

I 

j  components  to  encode  the  position  of  the  fourth  disk,  then  some  of  the  resulting 

;  action  heuristics  are  wrong  and  some  appropriately  transferred.  The  negative  weights 

I 

;  preventing  the  selection  of  an  action  that  is  the  inverse  of  the  previous  action  are 

still  very  helpful  for  the  four-disk  puzzle.  Some  of  the  two-step  sequences  might  also 
be  applicable.  To  learn  the  four-disk  solution,  the  evaluation  function  must  also  be 
adjusted,  since  it  is  very  tailored  to  the  three-disk  version.  Therefore,  the  solution  of 
the  four-disk  puzzle  would  require  additional  learning,  although  probably  less  than 
would  be  needed  by  a  naive  system  that  has  no  experience  with  the  three-disk  puzzle. 

The  second  form  of  transfer  concerns  different  initial  and  goal  states.  Langley’s 
|  system  was  not  capable  of  transferring  to  Tower  of  Hanoi  puzzles  with  different  initial 

and  goal  states,  but  he  has  shown  on  another  task  how  the  inclusion  in  the  system 
of  a  representation  of  the  goal  can  lead  to  strategies  that  are  goal-dependent.  The 
action  function  learned  by  our  system  might  generalize  correctly  to  different  initial 
i  states,  particularly  those  close  to  the  minimal  solution  path,  but  this  was  not  tested. 

The  evaluation  function  does  generalize  correctly  to  different  initial  states.  As  shown 
in  Fig.  32,  the  evaluations  increase  for  states  closer  to  the  goal,  whether  or  not  the 
states  are  on  the  minimal  solution  path.  Therefore,  learning  would  be  facilitated  if 
the  initial  state  were  changed  from  its  original  position  after  the  evaluation  function 
had  been  learned. 

As  for  different  goal  states,  both  action  and  evaluation  networks  have  learned 
inappropriate  functions.  In  fact,  generalization  to  a  puzzle  with  a  different  goal  state 
would  retard  the  learning  of  a  new  solution  path.  As  Langley  suggested,  to  learn 
evaluation  and  action  functions  for  different  goals,  some  representation  of  the  goal 
must  be  included  as  input  to  the  networks.  This  could  he  done  very  simply  by 
duplicating  the  terms  of  the  current  state  representation  and  using  them  to  encode 
the  goal  state.  Different,  evaluation  and  action  functions  would  then  be  learned  for 
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different  goals,  though  a  multilayer  action  network  would  probably  be  required. 

Conclusion 

In  Section  5,  a  connectionist  learning  method  was  applied  to  a  task  having  a  large 
search  space,  delayed  reinforcement,  and  requiring  non-trivial  (nonlinear)  combina¬ 
tions  of  features.  In  this  section,  essentially  the  same  method  was  applied  to  a  task 
with  a  small  search  space,  requiring  non-trivial  feature  combinations,  and  for  which 
reinforcement  is  delayed  and  infrequent.  We  have  shown  how  some  of  the  credit- 
assignment  techniques  that  have  been  developed  for  learning  rules  while  doing  can 
be  incorporated  into  a  reinforcement  scheme. 

The  adaptive  network  was  able  to  learn  the  solution  to  the  three-disk  Tower  of 
Hanoi  puzzle.  The  time  (amount  of  experience)  required  to  solve  it  is  much  greater 
than  that  required  by  Langley’s  [31]  adaptive  production  system,  but  fewer  assump¬ 
tions  are  incorporated  into  the  design  of  the  connectionist  learning  method.  A  very 
limited  input  representation  is  used,  consisting  only  of  the  current  state  and  the 
two-previous  actions.  Comparisons  of  this  connectionist  approach  with  symbolic 
approaches  highlights  some  of  the  limitations  of  connectionist  representations.  For 
example,  the  connectionist  system  used  for  the  Tower  of  Hanoi  experiments  is  not  ca¬ 
pable  of  doing  variable  binding  in  the  way  that  Langley’s  [31]  production  system  can. 
Langley’s  system  was  able  to  learn  a  single  symbolic  rule  that  uses  action  variables  to 
prohibit  actions  that  are  the  inverses  of  the  previous  actions.  Langley’s  production 
system  was  able  to  learn  such  rules  using  built-in  knowledge  of  what  “inverse”  means 
and  how  particular  actions  and  states  can  be  generalized  to  variables.  In  our  imple¬ 
mentation,  actions  are  not  generalized  to  variables;  distinct  negative  weights  from 
each  action  to  its  inverse  had  to  be  learned.  Touretzky  and  Hinton  discuss  issues  of 
this  kind  and  present  some  connectionist  approaches  to  these  problems  |54,53|. 
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SECTION  7 


SUMMARY  AND  CONCLUSIONS 


The  major  focus  of  the  research  reported  here  was  the  study  of  layered  networks 
for  learning  nonlinear  associative  mappings.  We  continued  our  approach  to  this 
problem  based  on  the  cooperative  interaction  of  self-interested,  goal-seeking  network 
components,  but  we  also  looked  at  other  approaches.  The  major  results  of  this 
research  are  presented  in  this  report  and  summarized  here  in  this  section.  I  also 
include  discussion  of  how  the  learning  methods  we  have  studied  relate  to  existing 
methods  and  what  avenues  appear  promising  for  future  research. 


The  Associative  Reward-Penalty  Unit 

The  Associative  Reward-Penalty  Unit,  or  Ar^p  Unit,  is  a  neuron-like  adaptive 
unit  that  implements  a  learning  rule  which  is  a  synthesis  of  two  types  of  learning 
methods  that  have  usually  been  studied  separately.  Under  one  set  of  restrictions,  the 
Ar_p  learning  rule  specializes  to  a  stochastic  learning  automaton  algorithm  that  has 
been  widely  studied  in  the  past;  under  a  different  set  of  restrictions,  the  AR  P  rule 
specializes  to  a  supervised  pattern  classification  method  that  has  also  been  widely 
studied  (the  perceptron  learning  rule).  Consequently,  the  AR  p  rule  falls  in  the  inter¬ 
section  of  important  classes  of  learning  methods.  Although  the  “selective  bootstrap 
learning”  rule  of  Widrow  et  at.  [58]  is  a  very  close  relat  ive  of  the  Ar  p  rule,  we  beleive 
the  Ar  p  rule  is  novel.  The  recent  pattern  classification  method  of  Thathachar  and 
Sastry  [51 1  utilizes  stochastic  learning  automaton  methods  but  is  not  directed  toward 
solving  the  same  kind  of  tasks  as  is  the  Ar  p  method. 

In  Section  2  I  discussed  what  is  to  be  gained  by  the  kind  of  synthesis  represented 
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by  the  Ar_p  rule.  Units  employing  this  rule  do  not  need  explicit  instructional  infor¬ 
mation  in  order  to  learn  associative  mappings  and,  in  fact,  are  able  to  learn  under 
extreme  uncertainty.  This  ability  has  implications  for  enriching  the  type  of  game 
and  team  problems  that  can  be  studied  (as  illustrated  by  our  layered-network  simu¬ 
lations  of  Section  3)  and  for  applications  to  control  problems  (as  illustrated  by  our 
pole-balancing  and  Tower  of  Hanoi  examples  of  Sections  5  and  6). 

Work  that  remains  to  be  done  regarding  the  theory  of  single  Ar_p  units  concerns 
their  behavior  in  problems  in  which  the  input  vectors  are  not  linearly  independent. 
The  Ar_p  convergence  theorem  applies  only  to  the  case  of  linearly  independent  input 
vectors,  but  the  utility  of  the  Ar_p  rule  is  not  restricted  to  this  case,  and  it  is 
likely  that  the  convergence  results  can  be  extended.  The  asymptotic  behavior  of 
a  single  Ar_p  unit  needs  to  be  examined  in  cases  in  which  the  input  vectors  are 
linearly  separable  and  not  linearly  separable.  Preliminary  simulations  suggest  that 
Ar  p  units  maximize  reward  probability  in  these  cases,  which  is  not  what  the  usual 
methods  do.  This  behavior  could  have  interesting  implications  that  we  have  not 
vet,  explored.  We  placed  higher  priority  on  studying  the  cooperative  behavior  of 
interconnected  Ar_p  units.  Although  this  research  direction  makes  it  more  difficult 
to  obtain  mathematical  results,  we  pursued  it  because  of  our  basic  interest  in  studying 
collective  behavior. 


Cooperative  Behavior  of  AR_p  Units 

In  the  same  way  that  the  Ar_p  learning  rule  can  be  viewed  either  in  terms  of 
adapt  ive  pattern  classification  or  in  terms  of  stochastic  learning  automata,  networks 
of  Ar  p  units  can  be  viewed  either  in  terms  of  connectionist  adaptive  networks  or 
in  terms  of  the  collective  behavior  of  stochastic  learning  automata.  In  Section  3, 
I  discussed  layered  networks  of  Ar  p  units  from  both  perspectives.  The  examples 
described  in  that  section  show  how  these  networks  can  learn  to  solve  nonlinear  dis¬ 
crimination  problems.  Networks  of  Ar  p  units,  or  layered  teams  of  Ar  p  units,  are 
therefore  examples  of  systems  that  can  adaptively  develop  representations  in  order 
to  form  nonlinear  associative  mappings. 


The  method  for  doing  this  that  is  most  closely  related  to  Ar_p  networks  is  the 
error  back-propagation  method  of  Rumelhart,  Hinton,  and  Williams  (44),  which  was 
presented  at  about  the  same  time  that  we  first  published  results  of  Ar_p  network  sim¬ 
ulations.  Since  then,  Williams  [61,62)  has  shown  that  there  is  a  strong  relationship 
between  these  methods:  They  are  both  gradient  following  procedures.  Whereas  gra¬ 
dient  information  is  directly  computed  via  the  backward  pass  in  the  back-propagation 
method,  in  AR_P  networks  it  is  estimated  via  the  sampling  procedure  realized  by  the 
stochastic  units. 

It  is  therefore  not  surprising  that  the  back-propagation  method  is  faster  than 
the  Ar_p  method  (confirmed  by  our  comparative  studies  summarized  next).  So 
what  advantages  might  the  Ar_p  method  have  over  back-propagation?  First,  the 
Ar _p  method  provides  a  link  to  a  wide  body  of  literature  (the  learning  automata  lit¬ 
erature)  that  has  not  yet  been  explored  by  connectionists.  I  think  that  a  number  of 
interesting  consequences  may  arise  from  this  connection.  Second,  the  Ar_p  method 
does  not  require  the  complex  back-propagation  computation.  Consequently,  it 
may  have  some  advantages  for  implementation  by  parallel  hardware  and  might  be 
more  plausible  than  back-propagation  from  a  biological  perspective.  Third,  the 
Ar  p  method  might  be  extensible  to  the  case  of  recurrent  networks  with  asymmet¬ 
ric  connection  matrices  in  ways  that  back-propagation  is  not  (some  recent  results  by 
Williams  (61,62)  are  relevant  in  this  regard).  Additional  research  is  needed  to  explore 
these  possibilities. 

One  of  the  most  important  questions  regarding  network  learning  methods  is  how 
well  they  scale  up  to  larger,  more  difficult  problems.  The  research  covered  by  this 
report  does  not  address  this  question.  What  we  have  learned  about  the  Ar.  p  network 
method,  however,  suggests  that  a  straightforward  scaling  up  of  the  method  will  not 
be  effective.  By  a  straightforward  scaling  up  of  the  method  I  mean  that  the  single 
reinforcement  signal  is  just  broadcast  to  a  larger  number  of  Ar  p  units.  As  in  the 
case  of  the  error  back-propagation  method,  as  networks  get  larger,  the  number  of 
possible  solutions  can  increase  so  that  learning  can  occur  faster  for  bigger  networks. 
However,  in  the  Ar  p  method,  as  networks  get  larger,  the  amount  of  noise  that 
contaminates  the  gradient  estimates  increases,  a  problem  that  is  not  present  in  the 


back-propagation  method.  Thus,  while  both  methods  probably  scale  poorly,  the 
straightforward  scaling  up  of  the  Ar_p  method  is  likely  to  be  worse. 

One  can,  however,  consider  ways  of  scaling  up  the  Ar_r  method  that  are  more 
interesting  than  just  adding  more  units  and  uniformly  broadcasting  a  single  reinforce¬ 
ment  signal  to  all  of  them.  It  seems  clear  that  the  only  way  to  move  toward  large, 
complex  learning  tasks  is  to  use  modular  or  hierarchical  networks  with  local  forms 
of  reinforcement.  I  envision  networks  in  which  superordinate  modules  learn  how  to 
provide  different  levels  of  reinforcement  to  different  subordinate  modules.  This  ap¬ 
proach  will  involve  game  decision  problems  in  addition  to  the  team  decision  problems 
discussed  in  Section  3.  Consequently,  the  ability  of  units  employing  stochastic  learn¬ 
ing  automaton  principles  to  learn  in  game  situations  may  be  an  important  factor  in 
implementing  these  more  sophisticated  forms  of  structural  credit  assignment.  This 
is  an  important  topic  for  future  research. 

Comparison  of  Methods  for  Learning  by  Layered  Networks 

Eleven  hidden-unit  learning  methods  were  compared  by  applying  them  to  the  task 
of  learning  a  multiplexer  function.  The  methods  were  tested  in  the  hidden  units  of 
a  two-layer  network.  Two  kinds  of  performance  measures  were  used:  the  number  of 
errors  accumulated  throughout  a  training  run  and  the  total  number  of  input  vectors 
for  which  the  final  weight  values  of  a  run  result  in  an  incorrect  output.  Care  was 
taken  to  try  different  parameter  values  for  each  method  and  to  present  performance 
measures  as  averages  and  confidence  intervals  over  repetitive  training  runs. 

The  learning  method  with  the  best  performance  of  the  algorithms  compared  was 
the  error  back-propagation  algorithm  of  Rumelhart,  Hinton,  and  Williams  |44].  The 
next  best  performing  methods  were  the  reinforcement-learning  methods  based  on  the 
Ar  |>  rule.  Best  among  these  methods  was  a  modification  of  the  Ar_p  rule  designed 
to  combine  reinforcement  learning  with  a  method  to  create  features  to  represent  input 
patterns  present  when  the  network  is  receiving  low  reinforcement.  A  less  successful 
modification  of  the  AR  P  method  is  based  on  the  idea  of  providing  each  hidden  unit 
with  a  more  informative  evaluation  signal  than  is  provided  by  a  reinforcemnt  signal 


broadcast  to  all  the  units.  In  this  method,  reinforcement  values  are  back-propagated 
based  on  the  weight  with  which  the  hidden  units  are  connected  to  the  output  unit 
(unlike  the  back-propagation  method  of  Rumelhart  et  al.,  reinforcement  is  back- 
propagated  instead  of  error).  This  modification  produces  faster  error  reduction  early 
in  learning  runs,  but  later  in  runs  the  rate  of  error  reduction  slows  and  is  surpassed 
by  that  of  the  unmodified  Ar_p  method. 

Some  more  conventional  optimization  techniques  were  also  applied  to  the  problem 
of  finding  weight  values  for  the  multiplexer  task.  These  methods  perform  a  direct 
search  in  the  space  of  all  possible  value  assigments  to  the  weights  of  the  hidden  units. 
They  do  not  use  any  knowledge  of  the  network’s  structure.  Such  a  large  search  space 
and  the  ignorance  of  the  network’s  structure  results  in  very  poor  learning  performance 
compared  to  the  other  methods  tested.  We  included  these  methods  primarily  to  serve 
as  control  simulations.  One  of  these  methods,  for  example,  is  probably  the  simplest 
possible  search  technique.  To  be  of  any  interest  at  all,  a  method  must  perform  better 
than  this  method. 

We  also  experimented  with  the  idea  of  improving  the  accuracy  of  the  gradient 
estimate  produced  by  the  AR_p  method  by  letting  each  hidden  unit  try  several  ac¬ 
tions  while  each  of  the  training  vectors  is  present.  We  call  this  method  the  batched 
Ar-p  method.  The  results  of  these  simulations  show  that  it  is  possible  to  obtain 
increasingly  accurate  gradient  estimates  without  requiring  a  complex  error  propa¬ 
gation  process.  Letting  the  hidden  units  obtain  10  samples  for  each  training  vector 
presentation,  we  obtained  learning  in  the  multiplexer  task  several  times  faster  than 
the  unbatched  method  (1  sample  per  presentation)  in  terms  of  the  number  of  presen¬ 
tations  of  each  training  vector.  Of  course,  the  amount,  of  processing  required  for  each 
presentation  is  greater  than  in  the  unbatched  method,  and  the  actual  time  required 
will  depend  critically  on  how  the  system  is  implemented.  If  in  some  learning  domain 
it  is  costly  to  obtain  stimulus  vectors,  but  it  is  not  costly  to  update  the  network  and 
obtain  evaluations,  then  it  might  be  practical  to  use  the  batched  method  to  increase 
the  speed  of  learning. 

As  in  all  empirical  studies,  it  is  important  to  stress  that  the  results  presented  in 
Section  4  are  valid  only  for  the  particular  task  and  training  regime  that  was  used 
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during  the  experiments.  For  example,  a  task  requiring  a  smaller  network  might  be 
most  readily  solved  by  a  random  search  of  the  entire  weight  space.  In  fact,  in  selecting 
a  task  for  the  comparative  study,  a  small  task  with  two  input  components  was  tested 
and  it  was  discovered  that  a  random  search  solved  the  task  faster  than  the  error  back- 
propagation  and  reinforcement-learning  algorithms.  The  multiplexer  task  was  chosen 
because  the  weight  space  is  sufficiently  large  that  direct  optimization  methods  are 
slow  but  not  so  large  that  a  prohibitive  amount  of  simulation  time  would  bo  needed 
to  gather  significant  performance  statistics.  Time  also  prohibited  the  extension  of 
the  comparative  study  to  other,  more  complex  tasks  as  would  be  required  to  address 
issues  regarding  how  well  the  algorithms  scale  up  to  harder  tasks. 

Strategy  Learning  with  Multilayer  Networks 

Strategy  learning  can  be  characterized  as  the  acquisition  of  a  method  for  gen¬ 
erating  actions  that  cause  desired  transitions  among  the  states  of  a  problem.  The 
desirability  of  particular  transitions  is  often  indicated  by  an  evaluation  that  imposes 
a  preference  ordering  on  the  possible  transitions  from  a  given  state.  In  previous 
research,  we  have  shown  that  reinforcement-learning  methods  can  be  used  to  learn 
to  select  the  best  action  under  these  conditions,  whereas  most  connectionist  learning 
methods  require  knowledge  of  the  correct  action. 

For  some  tasks,  an  evaluation  is  not  immediately  available  but  occurs  only  after 
a  sequence  of  actions  has  been  generated.  Sutton  (47,46)  has  developed  the  AHC 
learning  rule  for  dealing  with  this  temporal  credit-assignment  problem.  Hampson  [22] 
has  developed  a  similar  method.  Bart o,  Sutton,  and  Anderson  f  J 3 , 47,15]  combined 
the  ABC  rule  with  a  reinforcement-learning  method  into  a  single-layer  network  for 
strategy  learning.  In  the  research  reported  here,  we  extended  these  learning  methods 
for  single-layer  networks  to  methods  for  learning  strategies  with  multilayer  networks. 

We  chose  the  error  back-propagation  method  to  update  the  weights  of  the  hidden 
units  in  networks  because  it  was  shown  to  be  fastest  by  our  comparative  simulations. 
Consequently,  the  strategy-learning  networks  we  studied  consisted  of  the  following. 
For  each  task,  there  is  an  evaluation  network  and  an  action  network.  The  evalua- 
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tion  network  consists  of  an  AHC  unit  as  output  unit  and  a  layer  of  hidden  units. 
The  AHC’s  error  term  is  back-propagated  to  the  hidden  units  in  exactly  the  same 
manner  that  the  error  term  of  the  delta  rule  is  back-propagated  in  the  Rumelhart 
et  al.  method.  The  action  network  consists  of  an  associative  reinforcement  learning 
unit  as  output  unit  and  a  layer  of  hidden  units.  This  unit’s  “error  term”  is  back- 
propagated  to  the  hidden  units.  This  hybrid  system  was  applied  to  the  pole-balancing 
and  Tower  of  Hanoi  tasks. 

Pole  Balancing 

The  major  difficulty  in  the  pole-balancing  task  is  due  to  the  use  of  a  very  uninfor¬ 
mative  evaluation  signal.  Task-specific  information,  such  as  the  dynamics  of  the  pole, 
were  assumed  to  be  unavailable  to  the  design  of  the  learning  system.  The  evaluation 
signal  was  supplied  only  when  the  pole  fell  or  the  cart  hit  the  end  of  the  track.  Other 
information  concerning  the  task  objective,  such  as  the  advantage  of  maintaining  the 
pole  near  the  vertical,  was  not  assumed.  Of  course,  if  such  information  is  available 
it  should  be  incorporated  into  the  initial  design  of  the  learning  system  or  used  to 
provide  a  more  information  evaluation  or  error  signal.  Our  interests,  however,  are  in 
developing  learning  methods  for  those  parts  of  a  task  for  which  a  minimum  amount 
of  information  is  available.  For  example,  we  were  not  attempting  to  design  a  learn¬ 
ing  control  method  for  the  pole-balancing  task  per  se — much  more  information  is 
available  for  this  task  than  we  were  willing  to  use. 

The  combination  of  the  error  back-propagation  method  with  the  AHC  and 
reinforcement-learning  methods  was  successful:  the  two-layer  system  balanced  the 
pole  for  many  more  steps  than  did  a  one-layer  system  receiving  the  same  represen¬ 
tation  of  the  pole’s  state.  The  hidden  units  learned  features  with  which  the  output 
units  could  overcome  the  limitations  imposed  by  the  representation  of  the  pole’s  state 
and  the  linearity  of  the  output  functions.  In  analyzing  the  new  features  that  were 
formed,  it  was  discovered  that  only  a  small  number  of  new  features  were  needed  to 
solve  the  task.  Some  runs  resulted  in  the  formation  of  a  single  new  feature,  while 
others  resulted  in  up  to  three  features  that  developed  significant  influence  on  the 


Our  previous  experiments  [13,47,15]  with  the  pole-balancing  task  involved  a 
single-layer  network  for  which  the  continuous  state  space  was  discretized  into  162 
distinct,  4-dimensional  rectangles  to  allow  the  system’s  units  to  learn  appropriate 
functions.  The  networks  whose  study  is  reported  here  differ  in  the  absence  of  this 
“decoder”  and  the  addition  of  hidden  units  that  learn  features  that  decode  the  state 
into  an  appropriate  form.  Another  difference,  which  makes  performance  comparisons 
with  our  earlier  pole-balancing  studies  difficult,  is  that  after  every  failure  the  state  of 
the  pole  is  set  to  a  random  state  instead  of  the  zero  state  (vertical,  stationary  pole), 
as  was  done  in  the  previous  experiments.  For  this  reason,  many  more  failures  were 
generated  in  the  current  paradigm,  because  some  reset  states  were  very  near  failure 
states.  After  the  same  number  of  training  steps,  the  current  system  had  not  attained 
as  high  an  average  balancing  time  as  had  the  previous  system.  This  is  due  to  either 
a)  the  additional  experience  needed  to  learn  useful  new  features,  or  b)  the  lack  of 
experience  in  critical  states  (such  as  the  zero  state)  for  which  nearby  states  require 
opposite  actions. 

Tower  of  Hanoi 

The  learning  methods  used  in  the  pole-balancing  network  were  applied  with  few 
modifications  to  the  Tower  of  Hanoi  puzzle.  Similar  restrictions  on  the  amount  of 
a  priori  knowledge  were  assumed.  A  final  reinforcement  at  the  end  of  a  successful 
sequence  of  actions  (as  opposed  to  an  unsuccessful  sequence  for  the  pole-balancing 
task)  provided  information  regarding  the  objective  of  the  task.  The  state  of  the 
puzzle  was  presented  to  the  network  as  a  binary  vector  representing  the  peg  on  which 
each  disk  resides.  The  two-layer  network  again  performed  better  than  a  single-layer 
network.  The  two-layer  network  reliably  found  a  minimum-length  solution,  i.e.,  the 
network  applied  a  sequence  of  actions  consisting  of  the  minimum  number  of  actions 
required  to  achieve  the  goal  state.  In  solving  the  puzzle,  an  evaluation  function  was 
learned  that  ranked  states  according  to  the  smallest  number  of  moves  between  the 
state  and  the  goal  state. 

In  learning  the  evaluation  function,  a  number  of  new  features  were  developed  by 
the  network.  In  the  state-transition  graph  for  the  Towers  of  Hanoi  puzzle  there  are 


several  bottlenecks — parts  of  the  graph  are  interconnected  by  «.  single  path.  New 
features  were  formed  that  discriminate  states  in  the  bottlenecks  from  other  states. 
The  output  unit  of  the  evaluation  network  could  not  learn  a  monotonically-increasing 
function  through  the  bottlenecks  with  the  original  features,  but  the  new  features 
resulted  in  a  good  evaluation  function. 

Langley  [31)  developed  an  adaptive  production  system  that  learned  to  solve  the 
Tower  of  Hanoi  puzzle.  The  connectionist  system  that  we  applied  to  the  Tower  of 
Hanoi  puzzle  has  few  similarities  to  Langley’s  production  system.  It  is  instructive  to 
analyze  the  differences  and  to  question  whether  or  not  they  represent  fundamental 
distinctions  between  symbolic  and  connectionist  approaches.  One  difference  is  that 
Langley  uses  a  full  history  of  past  states  and  actions  to  aid  the  assignment  of  credit, 
whereas  the  connectionist  system  relies  on  the  learning  of  a  good  evaluation  function 
to  solve  the  credit  assignment  problem.  This  difference  is  not  fundamental  to  the 
representations  involved;  evaluation  functions  can  be  used  for  symbolic  systems  and 
a  history  of  states  and  actions  can  be  of  use  in  training  a  connectionist  system.  A 
history  could  be  used  much  as  it  is  for  the  symbolic  system,  by  retrieving  the  events 
as  training  instances.  A  separate  issue  is  the  association  by  a  connectionist  system 
of  past  events  with  current  action  probabilities  in  order  to  base  decisions  on  previous 
states  and  actions.  In  our  experiments,  the  connectionist  system  does  receive  the 
two  previous  actions  as  input,  so  two  and  three  step  sequences  can  be  learned;  the 
inclusion  of  all  past  events  as  input  to  the  system  is  not  feasible.  An  alternative  is 
to  collapse  the  history  into  a  weighted  average  of  past  events. 

Another  difference  is  that  a  breadth-first  search  is  not  performed  by  the  connec¬ 
tionist  system.  In  its  initial,  naive  state,  the  connectionist  system  chooses  actions 
randomly  and  as  the  evaluation  function  develops,  the  action  probabilities  become 
increasingly  biased  towards  actions  that  result  in  state  transitions  producing  positive 
changes  in  the  evaluation  function.  Breadth-first  search  control  can  be  added  to  the 
connectionist  framework  by  disregarding  t  he  probabilistic  generation  of  actions  and 
presenting  state-action  pairs  as  training  instances  after  some  process  has  assigned 
credit  to  every  pair.  Learning  an  evaluation  function  in  this  case  requires  the  ex¬ 
traction  of  desirable  paths  from  a  state  history.  One  attraction  of  the  connectionist 


approach  demonstrated  here  is  its  ability  to  learn  with  minimal  resources  for  search 
control  and  history  maintenance. 

A  very  important  distinction  that  is  currently  a  topic  of  debate  is  the  use  of 
variables.  A  single  symbolic  rule  can  be  applied  to  many  situations  through  the 
binding  of  variables.  For  example,  Langley’s  system  learns  a  rule  that,  through 
variable  binding,  can  be  used  to  avoid  the  application  of  the  previous  action’s  inverse 
for  all  possible  actions.  With  one  training  instance  and  knowledge  of  what  an  action’s 
“inverse”  means,  a  single  rule  is  learned  that  generalizes  to  all  other  actions.  It  is 
not  clear  how  knowledge  of  an  action’s  inverse  can  be  used  in  a  connectionist  system 
to  either  a)  learn  the  connectionist  analog  of  a  generalized  rule  with  variables,  or 
b)  duplicate  the  weight  changes  due  to  experience  with  one  action  to  the  weights  of 
other  actions.  Touretzky  and  Hinton  [54]  have  shown  how  variable  binding  can  be 
performed  in  a  particular  connectionist  system. 

Related  to  the  issue  of  variables  is  the  issue  of  the  transfer  of  learning.  After 
learning  strategies  for  solving  one  task,  an  efficient  learning  system  must  be  able  to 
exploit  common  aspects  between  this  task  and  subsequent  tasks  by  applying  in  similar 
situations  the  strategies  that  worked  well  for  the  first  task.  Langley’s  production  rule 
having  a  variable  action  and  state  can  be  immediately  applied  to  other,  more  complex 
Tower  of  Hanoi  puzzles.  This  is  not  possible  for  the  connectionist  system  and  the 
state  and  action  representations  used  here.  Learning  is  transferred  but  not  to  the 
degree  possible  with  variablized  rules.  The  strategies  learned  from  the  3-disk  Tower 
of  Hanoi  are  specifically  dependent  on  the  3  disks — the  addition  of  another  disk  does 
not  affect  the  strategies  until  further  learning  occurs.  Different  representation  of 
states  and  actions  would  result  in  different  amounts  of  transfer. 

Further  Developments  of  Strategy  Learning  Networks 

Our  strategy  learning  networks  can  be  viewed  in  the  context  of  two  theoretical 
traditions,  and  future  research  can  take  two  directions  depending  on  which  tradition 
is  followed.  One  tradition  is  that  of  control  theory — adaptive  and  learning  control. 
The  learning  methods  employed  by  our  networks  are  related  to  some  discussed  in  the 
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past  by  control  theorists,  e.g.,  Refs.  [20,34,57,60].  These  methods  differ  substantially 
from  the  more  orthodox  adaptive  control  techniques  that  involve  the  identification 
of  unknown  plant  parameters  in  that  they  can  be  applied  with  fewer  assumptions 
about  the  structure  of  the  plant  to  be  controlled.  However,  these  methods  do  not 
lend  themselves  to  rigorous  convergence  results  and  so  have  not  been  actively  pursued 
by  modern  control  theorists. 

I  think  that  the  methods  illustrated  by  the  results  reported  here  contribute  several 
new  methods  to  this  “unorthodox”  approach  to  learning  control.  One  contribution 
is  the  use  of  layered  networks  to  learn  nonlinear  control  rules.  Second,  the  AHC 
method  developed  by  Sutton  for  dealing  with  temporal  credit-assignment  may  be 
a  significant  novel  method.  Finally,  the  Ar_p  learning  rule  is  applicable  to  these 
types  of  control  tasks  (although  the  strategy  learning  networks  discussed  in  this 
report  do  not  use  it).  In  order  to  continue  the  development  of  these  methods  within 
this  framework  of  learning  control  it  is  necessary  to  develop  the  theory  as  much  as 
possible.  Although  I  do  not  think  one  will  be  able  to  prove  broad  convergence  results 
for  these  types  of  methods  applied  to  nonlinear  control  problems,  I  think  that  the 
methods  need  to  be  developed  to  the  point  where  they  can  be  applied  more  routinely. 
In  order  to  accomplish  this,  these  methods  need  to  be  applied  to  control  tasks  that  are 
simpler  than  the  pole-balancing  task  studied  here  so  that  network  design  decisions 
can  be  made  with  the  aid  of  relevant  theory  and  results  can  be  compared  with  those 
obtainable  by  more  conventional  methods.  Of  course,  the  eventual  goal  is  to  develop 
learning  control  methods  for  problems  to  which  the  conventional  methods  are  not 
applicable. 

The  other  tradition  to  which  our  work  can  be  related  is  the  symbolic  artificial 
intelligence  tradition  illustrated  by  the  adaptive  production  system  of  Langley  to 
which  we  compared  the  Tower  of  Hanoi  network  in  Section  6.  In  order  to  make 
closer  contact  with  this  tradition  it  is  necessary  to  develop  more  sophisticated  repre¬ 
sentational  schemes  that  facilitate  the  kinds  of  functions  accomplished  by  variables 
and  variable  binding.  It  also  seems  necessary  to  develop  a  means  for  networks  to 
perform  something  like  multistep  reasoning  processes.  Efforts  in  these  directions  are 
being  made  by  some  connectionist  researchers  (e.g.,  [54,53]),  but  I  know  of  no  “nat- 
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ural”  connectionist  way  of  doing  these  things.  I  think  that  a  first  step  toward  these 
types  of  capabilities  is  to  develop  means  for  networks  to  adaptively  form  internal 
models  of  their  environments  which  they  can  manipulate  for  a  variety  of  purposes. 
We  and  others  have  taken  a  few  steps  in  this  direction  (e.g.,  [49,50,43,25)),  but  much 
progress  remains  to  be  made.  This  direction  of  research  is  also  relevant  for  poten¬ 
tial  applications  of  networks  to  the  types  of  engineering  control  problems  discussed 
above. 


Bibliography 


(1]  D.  H.  Ackley,  G.  E.  Hinton,  and  T.  J.  Sejnowski.  A  learning  algorithm  for 
Boltzmann  machines.  Cognitive  Science ,  9:147-169,  1985. 

[2]  S.  Amarel.  Problems  of  Representation  m  Heuristic  Problem  Solving:  Related 
Issues  in  the  Development  of  Expert  Systems.  Technical  Report  CBM-TR-118, 
Laboratory  for  Computer  Science,  Rutgers  University,  New  Brunswick,  NJ, 
1981. 

( 3 j  C.  W.  Anderson.  Feature  Generation  and  Selection  by  a  Layered  Network  of 
Rremforcement  Learning  Elements:  Some  Initial  Experiments.  Technical  Re¬ 
port  82-12,  Department  of  Computer  and  Information  Science,  University  of 
Massachusetts,  Amherst,  MA,  1982. 

[4]  C.  W.  Anderson.  Learning  and  Problem  Solving  with  Multilayer  Connectionist 
Systems.  PhD  thesis,  University  of  Massachusetts,  Amherst,  MA,  1986. 

[5)  Y.  Anzai  and  H.  A.  Simon.  The  theory  of  learning  by  doing.  Psychological 
Review ,  86,  1979. 

[6|  A.  G.  Barto.  Learning  by  statistical  cooperation  of  self-interested  neuron-like 
computing  elements.  Human  Neurobiology ,  4:229  256,  1985. 

[7]  A.  G.  Barto  and  P.  Anandan.  Pattern  recognizing  stochastic  learning  automata. 
IEEE  Transactions  on  Systems,  Man,  and  Cybernetics,  15:360  375,  1985. 

[ 8 ]  A.  G.  Barto,  P.  Anandan,  and  C.  VV.  Anderson.  Cooperal  ivity  in  networks  of 
pattern  recognizing  stochastic  learning  automata.  In  Proceedings  of  the  Fourth 
Yale  Workshop  on  Applications  of  Adaptive  Systems  Theory.  New  Haven,  (T, 
May  1985.  An  extended  version  of  this  paper  appeared  in  Adaptive  and  Learning 
Systems,  K.  S.  Narendra  (ed.),  Plenum,  1986. 


|9|  A.  G.  Barto  and  C.  W.  Anderson.  Structural  learning  in  connectionist.  sys¬ 
tems.  In  Proceedings  of  the  Seventh  Annual  Conference  of  the  Cognitive  Science 
Society ,  Irvine,  CA,  August  1985. 

jlO]  A.  G.  Barto,  C.  W.  Anderson,  and  R.  S.  Sutton.  Synthesis  of  nonlinear  control 
surfaces  by  a  layered  associative  search  network.  Biological  Cybernetics ,  43:175- 
185,  1982. 

1 11)  A.  G.  Barto  and  R.  S.  Sutton.  Goal  Seeking  Components  for  Adaptive  Intelli¬ 
gence:  An  Initial  Assessment.  Technical  Report  AFWAL-TR-81-1070,  Air  Force 
Wright  Aeronautical  Laboratories/ Avionics  Laboratory,  Wright-Patterson  AFB, 
OH,  1981. 

[12)  A.  G.  Barto  and  R.  S.  Sutton.  Landmark  learning:  An  illustration  of  associative 
search.  Biological  Cybernetics,  42:1-8,  1981. 

[13)  A.  G.  Barto,  R.  S.  Sutton,  and  C.  W.  Anderson.  Neuronlike  elements  that  can 
solve  difficult  learning  control  problems.  IEEE  Transactions  on  Systems,  Man, 
and  Cybernetics,  13:835-846,  1983. 

[14)  A.  G.  Barto,  R.  S.  Sutton,  and  P.  S.  Brouwer.  Associative  search  network: 
A  reinforcement  learning  associative  memory.  IEEE  Transactions  on  Systems, 
Man,  and  Cybernetics,  40:201-211,  1981. 

[15)  A.  G.  Barto  editor.  Simulation  Experiments  with  Goal-Seeking  Adaptive  Ele¬ 
ments.  Technical  Report  AFWAL-TR-84-1022,  Air  Force  Wright  Aeronautical 
Laboratories/ Avionics  Laboratory,  Wright-Patterson  AFB,  OH,  1984. 

[16)  R.  R.  Bush  and  F.  Mosteller.  Stochastic  Models  for  Learning.  Wiley,  New  York, 
195s. 

[  1 7]  R.  H.  Cannon,  Jr.  Dynamics  of  Physical  Systems.  McGraw-Hill,  Inc.,  1967. 

[18]  R.  O.  Duda  and  P.  E.  Hart.  Pattern  Classification  and  Scene  Analysis.  Wiley, 


[19]  W.  K.  Estes.  Toward  a  statistical  theory  of  learning.  Psychological  Review , 
57:94-107,  1950. 

(20)  K.  S.  Fu.  Learning  control  systems — Review  and  outlook.  IEEE  Transactions 
on  Automatic  Control ,  210-221,  1970. 

[21 1  P.  E.  Gill,  W.  Murray,  and  M.  H.  Wright.  Practical  Optimization.  Academic 
Press,  New  York,  1981. 

(22)  S.  Hampson.  A  Neural  Model  of  Adaptive  Behavior.  PhD  thesis,  University  of 
California,  Irvine,  CA,  1983. 

(23]  G.  E.  Hinton  and  J.  A.  Anderson,  editors.  Parallel  Models  of  Associative  Mem¬ 
ory.  Erlbaum,  Hillsdale,  NJ,  1981. 

( 24 1  G.  E.  Hinton  and  T.  J.  Sejnowski.  Analyzing  cooperative  computation.  In 
Proceedings  of  the  Fifth  Annual  Conference  of  the  Cognitive  Science  Society , 
Rochester,  NY,  1983. 

[25]  M.  I.  Jordan.  Personal  communication. 

[26]  A.  H.  Klopf  and  E.  Gose.  An  evolutionary  pattern  recognition  network.  IEEE 
Transactions  on  Systems,  Man,  and  Cybernetics,  15:247-250,  1969. 

[27]  S.  Lakshmivarahan.  c-Optimal  Learning  Algorithms — Non-absorbing  Barrier 
Type.  Technical  Report  EECS  7901,  School  of  Electrical  Engineering  and  Com¬ 
puter  Science,  University  of  Oklahoma,  Norman,  OK,  1979. 

[28 J  S.  Lakshmivarahan.  Learning  Algorithms  and  Applications.  Springer- Verlag, 
New  York,  1981. 

[29)  S.  Lakshmivarahan  and  K.  S.  Narandra.  Learning  algorithms  for  two-person 
zero-sum  stochastic  games  with  incomplete  information.  Mathematics  of  Oper¬ 
ations  Research ,  6:379-386,  1981. 

[ 30]  S.  Lakshmivarahan  and  K.  S.  Narandra.  Learning  algorithms  for  two-person 
zero-sum  stochastic  games  with  incomplete  information:  A  unified  approach. 
SIAM  Journal  of  Control  and  Optimization,  20:541  552,  1982. 


4U 


|l.  |»  I’  'll. 


[31]  P.  Langley.  Learning  to  search:  From  weak  methods  to  domain-specific  heuris¬ 
tics.  Cognitive  Science ,  9:217-260,  1985. 

[32]  G.  F.  Luger.  The  use  of  the  state  space  to  record  the  behavioral  effects  of 
subproblems  and  symmetries  in  the  tower  of  hanoi  problem.  Journal  of  Man- 
Machine  Studies,  8:421-441,  1976. 

[33]  N.  J.  Mackintosh.  Conditioning  and  Associative  Learning.  Oxford  University 
Press,  New  York,  1983. 

[34]  J.  M.  Mendel  and  R.  W.  McLaren.  Reinforcement  learning  control  and  pattern 
recognition  systems.  In  J.  M.  Mendel  and  K.  S.  Fu,  editors,  Adaptive,  Learn¬ 
ing  and  Pattern  Recognition  Systems:  Theory  and  Applications,  pages  287-318, 
Academic  Press,  New  York,  1970. 

[35]  D.  Michie  and  R.  A.  Chambers.  BOXES:  An  experiment  in  adaptive  control.  In 
E.  Dale  and  D.  Michie,  editors,  Machine  Intelligence  2,  pages  137-152,  Oliver 
and  Boyd, 1968. 

[36]  K.  S.  Narendra  and  M.  A.  L.  Thathachar.  Learning  automata— A  survey.  IEEE 
Transactions  on  Systems,  Man,  and  Cybernetics,  4:323-334,  1974. 

[37]  K.  S.  Narendra  and  R.  M.  Wheeler.  An  n-player  sequential  stochastic  game 
with  identical  payoffs.  IEEE  Transactions  on  Systems,  Man,  and  Cybernetics, 
13:1154-1158,  1983. 

[38]  N.  J.  Nilsson.  Problem-Solving  Methods  in  Artificial  Intelligence.  McGraw-Hill, 
1971. 

1 30]  D.  T.  Politis  and  W.  H.  Licata.  Adaptive  decoder  for  an  adaptive  learning 
controller.  In  Proceedings  of  SPIE  Applications  of  Artificial  Intelligence  III, 
Orlando,  FL,  1986. 

[40]  D.  L.  Reilly,  L.  N.  Cooper,  and  C.  Elbaum.  A  neural  model  for  category  learning. 
Biological  Cybernetics,  45:35-41,  1982. 


135 


[41]  H.  Robbins.  Some  aspects  of  the  sequential  design  of  experiments.  Bulletin  of 
the  American  Mathematical  Society,  58:527-532,  1952. 

[42]  F.  Rosenblatt.  Principles  of  Neurodynamics:  Perceptrons  and  the  Theory  of 
Brain  Mechanisms.  Spartan  Books,  6411  Chillum  Place  N.W.,  Washington, 
D.C.,  1961. 

[43]  D.  E.  Rumelhart.  Personal  communication. 

[44]  D.  E.  Rumelhart,  G.  E.  Hinton,  and  R.  J.  Williams.  Learning  internal  represen¬ 
tations  by  error  propagation.  In  D.  E.  Rumelhart  and  J.  L.  McClelland,  editors, 
Parallel  Distributed  Processing:  Explorations  in  the  Microstructure  of  Cognition, 
vol.l:  Foundations ,  Bradford  Books/MIT  Press,  Cambridge,  MA,  1986. 

[45]  T.  J.  Sejnowski  and  C.  R.  Rosenberg.  NETtalk:  A  Parallel  Network  that  Learns 
to  Talk.  Technical  Report  EECS-8601,  Johns  Hopkins  University,  Department 
of  Electrical  and  Computer  Engineering,  Baltimore,  MD,  1986. 

[46]  R.  S.  Sutton.  Learning  to  Predict  by  the  Method  of  Temporal  Differences.  Tech¬ 
nical  Report,  GTE-Labs,  Waltham,  MA,  1987. 

[4 7|  R.  S.  Sutton.  Temporal  Aspects  of  Credit  Assignment  in  Reinforcement  Learn¬ 
ing.  PhD  thesis,  University  of  Massachusetts,  Amherst,  MA,  1984. 

[48]  R.  S.  Sutton.  Two  problems  with  backpropagation  and  other  steepest  descent 
learning  procedures  for  networks.  In  Proceedings  of  the  Eighth  Annual  Confer¬ 
ence  of  the  Cognitive  Science  Society ,  Amherst,  MA,  1986. 

[49]  R.  S.  Sutton  and  A.  G.  Barto.  An  adaptive  network  that  constructs  and  uses 
an  internal  model  of  its  world.  Cognition  and  Brain  Theory,  3:217  246,  1981. 

[50]  R.  S.  Sutton  and  B.  Pinette.  The  learning  of  world  models  by  connectionist 
networks.  In  Peoceedings  of  the  Seventh  Annual  Conference  of  the  Cognitive 
Science  Society,  Irvine,  CA,  1985. 

[51]  M.  A.  L.  Thathachar  and  P.  S.  Sastry.  Learning  optimal  discriminant  functions 
through  a  cooperative  game  of  automata.  IEEE  Transactions  on  Systems,  Man, 
and  Cybernetics,  17:73-85,  1987. 


[52]  E.  L.  Thorndike.  Animal  Intelligence.  Hafner,  Darien,  Conn.,  1911. 


[53]  D.  S.  Touretzky.  BoltzCONS:  Reconciling  connectionism  with  the  recursive 
nature  of  stacks  and  trees.  In  Proceedings  of  the  Eighth  Annual  Conference  of 
the  Cognitive  Science  Society ,  Amherst,  MA,  1986. 

[54]  D.  S.  Touretzky  and  G.  E.  Hinton.  Symbols  among  the  neurons:  Details  of  a 
connectionist  inference  architecture.  In  Proceedings  of  the  Ninth  International 
Joint  Conference  on  Artificial  Intelligence,  Los  Angeles,  CA,  1985. 

[55]  M.  L.  Tsetlin.  Automaton  Theory  and  Modeling  of  Biological  Systems.  Academic 
Press,  New  York,  1973. 

[56]  R.  Viswanathan  and  K.  S.  Narendra.  Games  of  stochastic  automata.  IEEE 
Transactions  on  Systems,  Man,  and  Cybernetics,  4:131-135,  1974. 

[57]  M.  D.  Waltz  and  K.  S.  Fu.  A  heuristic  approach  to  reinforcement  learning 
control  systems.  IEEE  Transactions  on  Automatic  Control,  10:390-398,  1965. 

[58]  B.  Widrow,  N.  K.  Gupta,  and  S.  Maitra.  Punish/reward:  Learning  with  a 
critic  in  adaptive  threshold  systems.  IEEE  Transactions  on  Systems,  Man,  and 
Cybernetics,  5:455-465,  1973. 

[59]  B.  Widrow  and  M.  E.  Hoff.  Adaptive  switching  circuits.  In  1960  WESCON 
Convention  Record  Part  IV,  pages  96-104,  1960. 

[60]  B.  Widrow  and  F.  W.  Smith.  Pattern-recognizing  control  systems.  In  Computer 
and  Information  Sciences  (COINS)  Proceedings,  Spartan,  Washington,  D.C., 
1964. 

[61]  R.  J.  Williams.  Reinforcement  Learning  in  Connectionist  Networks:  A  Math¬ 
ematical  Analysis.  Technical  Report  ICS  Report  8605,  Institute  for  Cognitive 
Science,  University  of  California  at  San  Diego,  La  Jolla,  CA,  1986. 

[62]  R.  J.  Williams.  Reinforcement-Learning  Connectionist  Systems.  Technical  Re¬ 
port  NU-CCS-87-3,  College  of  Computer  Science,  Northeastern  University,  360 
Huntington  Avenue,  Boston,  MA,  1987. 


137 

•  U.S. Government  Printing  Office:  1987  -  748-061/61011 


